How can i proof my results after mine some dataset? - machine-learning

I wonder if there´s anyway to proof the correctness of my results after apply some data mining algorithms to a set of data. When i say data mining algorithms im talking about the basic algorithms

If you have many examples, a simple way is to split available data in three partitions:
training data (around 50%-60% of available examples, randomly chosen);
validation data (20%-25%);
test data (20%-25%).
Training data are used to adjust parameters of the data mining algorithms.
With validation data you can compare models/algorithms/parameters and choose a winner.
Test data can give you a forecast of winner's performance in the "real world" because they are independent (during the training/validation phase you don't make any choice based on test data).
Anyway there are many schemes and probably the best place to delve deeper into the matter is http://stats.stackexchange.com

There can be several ways to proof correctness of your results. Firstly, you have to choose performance criteria
Accuracy of algorithm
Standard Deviation of results
Computation time
Based on either of these criteria, you have to adopt different-different mechanism to prove correctness of your algorithm.
1. Accuracy of algorithm
for this you have to understand, what are those point which can be questioned when you say that my algorithm's accuracy is XY.WZ%.
First question, is your algorithm giving better result because of over-fitting?
To avoid over-fitting by your algorithm, you can divide your data into three parts
training data
validation data
testing data
by doing so, if you are get good testing results, you can be sure that your algorithm did not over-fit. if there is a big difference between training and testing accuracy that is a sign of over-fitting.
What if you find out that your algorithm over-fit?
You can use several regularization techniques that keeps value of weights coefficient lower and helps in preventing over-fitting. You can know more about this in lectures of machine learning by Andre N.G at coursra.
Second question, is your data-set fairly chosen?
Suppose you have 100 dataset and you divided it in 50-30-20 set (training-validation-testing). Now question comes which 50 for training and which 30 dataset for validation and so on. So for different-2 selection of these data-set, you will get different-2 accuracy values. So, you should take 5-10 different-2 sets and then provide and average of results. This technique is known as cross-validation technique.
An another way to prove correctness of your algorithm is to provide confusion matrix in case of muticlass classification and sensitivity and specificity in case of binary classification. you can look at their wiki pages.
2. Standard deviation of results
If your algorithm is based on random population generation or based on heuristics then you are most likely to get different solution at each run of algorithm . In this case, you should provide an standard deviation of multiple runs on same data-set and same parameter setting by your algorithm.
3. computation time of algorithm
This might not be important in every case but if you are doing an comparison of your algorithm with other algorithm then you should provide comparison of computation time, however this has nothing to do with correctness of your algorithm but it does gives an idea of comprehensiveness of your algorithm.

What good are proven results?
At most you will be able to prove that your implementation matches some theoretical mathematical model, or that an approximative algorithm approximates this mathematical model.
But in practise, real data will not satisfy your mathematical assumptions anyway.
Often, the best proof is: does it work?
That is, on real, unseen data. Not on the data that you used to choose your parameters, because then you are prone to overfitting.

Related

Why KNN implementation in weka runs faster?

1) As we know KNN perform no computation in training phase instead defer all computations for classification because of which we call it lazy learner. It should take more time in classification than training however i found this assumption almost opposite about weka. In which KNN take more time in training than testing.
Why and how KNN in weka perform much faster in classification whereas in general it should perform slower ?
Does it also result in computations mistake ?
2) When we say feature weighting in Knn may improve performance for high dimensional data, what do we mean by saying it ? Do we mean that feature selection and selecting feature with high InformationGain ?
Answer to question 1
My guess is that the Weka implementation uses some kind of data structure to efficiently perform (approximate) nearest-neighbor queries.
Using such data structures, queries can be performed much more efficiently than performing them in the naive way.
Examples of such data structures are the KD tree and the SR Tree.
In the training phase the data structure has to be created, so it will take more time than classification.
Answer to question 2
(I'm not sure if you refer to predictive performance or performance as in speed-up. Since both are relevant, I will address them both in my answer.)
Using higher weights for the most relevant features and lower weights for the less relevant features may improve the predictive performance.
Another way to improve predictive performance is to perform feature selection. Using Mutual Information or some other kind of univariate association (like Pearson correlation for continuous variables) is the simplest and easiest way to perform feature selection. Note that reducing the number of variables can offer a significant speed-up in terms of computational time.
Of course, you can do both, that is, perform feature selection first and then use weights on the remaining features. You could for example use the mutual information to weight the remaining features. In case of text-classification, you could also use TF-IDF to weight your features.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Which machine learning classifier to choose, in general? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)
How do I know which classifier I should use?
Decision tree
SVM
Bayesian
Neural network
K-nearest neighbors
Q-learning
Genetic algorithm
Markov decision processes
Convolutional neural networks
Linear regression or logistic regression
Boosting, bagging, ensambling
Random hill climbing or simulated annealing
...
In which cases is one of these the "natural" first choice, and what are the principles for choosing that one?
Examples of the type of answers I'm looking for (from Manning et al.'s Introduction to Information Retrieval book):
a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes).
I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.
b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.
What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.
Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?
First of all, you need to identify your problem. It depends upon what kind of data you have and what your desired task is.
If you are Predicting Category :
You have Labeled Data
You need to follow Classification Approach and its algorithms
You don't have Labeled Data
You need to go for Clustering Approach
If you are Predicting Quantity :
You need to go for Regression Approach
Otherwise
You can go for Dimensionality Reduction Approach
There are different algorithms within each approach mentioned above. The choice of a particular algorithm depends upon the size of the dataset.
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Model selection using cross validation may be what you need.
Cross validation
What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).
How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.
Model selection
Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.
There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.
Nested cross validation
In nested cross validation, you perform cross validation on the model selection algorithm.
Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.
Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.
The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.
In short:
Boosting - often effective when a large amount of training data is available.
Random trees - often very effective and can also perform regression.
K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.
Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.
SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.
Things you might consider in choosing which algorithm to use would include:
Do you need to train incrementally (as opposed to batched)?
If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.
Is your data composed of categorical only, or numeric only, or both?
I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.
Does you or your audience need to understand how the classifier works?
Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.
How much classification speed do you need?
SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).
Complexity.
Neural nets and SVMs can handle complex non-linear classification.
As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.
For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).
Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.
And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.
As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.
Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.
My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.
So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.
Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.
You should always keep into account the inference vs prediction trade-off.
If you want to understand the complex relationship that is occurring in your data then you should go with a rich inference algorithm (e.g. linear regression or lasso). On the other hand, if you are only interested in the result you can go with high dimensional and more complex (but less interpretable) algorithms, like neural networks.
Selection of Algorithm is depending upon the scenario and the type and size of data set.
There are many other factors.
This is a brief cheat sheet for basic machine learning.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources