I am fairly new to Machine Learning and have recently been working on a new classification problem to which I'm giving the link below. Since cars interest me, I decided to go with a dataset that deals with the classification of cars based on several attributes.
http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Now, I understand that there might be a number of ways to go about this particular case, but the real issue here is - Which particular algorithm might be most effective?
I am considering Regression, SVM, KNN, and Hidden Markov Models. Any suggestions at all would be greatly appreciated.
You have a multi-class classification problem with 1728 samples. The features are in 6 groups:
buying v-high, high, med, low
maint v-high, high, med, low
doors 2, 3, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
what you need to do for features is to create features like this:
buying_v-high, buying-high, buying-med, buying-low, maint-v-high, ...
at the end you'll have
4+4+4+3+3+3 = 21
features. The output classes are:
class N N[%]
-----------------------------
unacc 1210 (70.023 %)
acc 384 (22.222 %)
good 69 ( 3.993 %)
v-good 65 ( 3.762 %)
You need to try several classification algorithms to see which one works better. For evaluation you can use cross-validation or you can put away say 728 or the samples and evaluate on that.
For classification models you iterate over 10 different classification models available in Machine Learning libraries and check which one is better. I suggest using scikit-learn for simplicity.
You can find a simple iterator over several classifiers in this script.
Remember that you need to tune some parameters for each model and you shouldn't tune them on the test set. So it is better to divide your samples into 1000 (training set), 350 (development set), 378 (test set). Use the development set to tune your parameters and to choose the best performing model and then use the test set to evaluate that model over unseen data.
Related
I am a medical doctor trying to make prediction models based on a database of approximately 1500 patients with 60+ parameters each.
I am dealing with a classification problem (mortality at 1, 3, 6 and 12 months) and have made stratified splits (70 training/ 30 testing) and have done feature selection with the Boruta algorithm before training Random forest, GLM and eXtreme Gradient Boosting models for each timepoints.
The AUC for all models is about 0.80 (RF model slightly better), Brier scores between 0.09-0.17 for the RF and between 0.13-0.23 for the other two.
So based on the Brier scores it seems that the RF models has a slight advantage but I am wondering:
-Should I do more performance measurements? Which ones and why?
-How to interpret my results? My understanding is that there is a linear association between the predictors as the GLM model performs well, but still the RF has a slight advantage in performance but has the disadvantage of being a more "complicated model".
I plan to do external validation with a different dataset but as of now I would be very interested in understanding if other measurements could shed some light on advantages of the different models and also I am sure I am missing something as I am new to the field and would be very interested to hear any advice/ opinions.
I realized that all the Monk's problems have test set bigger than their train set.
Why is this dataset organized like this? I think it's strange, even if it's a dummy dataset for models comparison.
Monk1
Train samples: 124
Test samples: 432
Monk2
Train samples: 169
Test samples: 432
Monk3
Train samples: 122
Test samples: 432
From the machine learning point of view, it absolutely doesn't matter how big the test set is. Why does it bother you? The real world looks the exact same way: you have N labeled samples for training, but there are N*10, N*1000, N*10^9 or more real cases out there so each (manually labeled, fixed) test set will necessarily be too small. The goal is to have a representative set, covering everything we expect in the real world, and if it means to have a YUGE™ test set, then the best thing you can do is to have a test set larger than training set.
In this particular case (and I'm not familiar with this particular task) it looks like the website you cited reads
There are three MONK's problems. The domains for all MONK's problems are the same (described below). One of the MONK's problems has noise added. For each problem, the domain has been partitioned into a train and test set.
The paper linked below
Wnek, J. and Michalski, R.S., "Comparing Symbolic and Subsymbolic Learning: Three Studies," in Machine Learning: A Multistrategy Approach, Vol. 4., R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, San Mateo, CA, 1993.
on page 20 reads as follows:
So in this particular scenario, the authors have chosen different training conditions, thus the three training sets. According to
Leondes, Cornelius T. Image processing and pattern recognition. Vol. 5. Elsevier, 1998, pp 307
they used all 432 available samples for training and trained on a subset of this data.
Having an overlap between training and test data is considered bad practice, but who am I to judge the research from 25 years ago in a field I'm not familiar with. Maybe it was too difficult to obtain more data and have a clean split.
I have a group of 20 yes/no/na questions that my company uses to assess whether or not to bid for an opportunity. To date, we have filled out the questionnaire 634 times.
The current algorithm simply divides yes / (yes + no) and a score over 50% recommends that we pursue the opportunity. n/a answers are disregarded.
We have tracked win/loss data on all of the pursuits, so I have a labeled dataset and I'm considering a supervised machine learning algorithm to replace our crude yes/no calculation.
I'm looking for a suggested method of supervised machine learning in Python (I'm most familar with SKLearn). Decision Tree Classifier?
Thank you in advance.
You have 20 y/n answers as features. Let yes be 1 and no be 0. So there are 20 binary features.
You also have target variable (win/loss) data. Let win be 1 and loss be 0. You can use an SVM/ NN right away. In my experience SVM and logistic regression give similar accuracies.
But if you are looking to explain each feature's contribution in shaping the decision, you should use naive-bayes or Decision Trees
It is important to know who is saying yeses and nos, so if you have 10 experts answering those 20 questions with yes/no/na, you have 10x20x3 states or binary features where every expert has 60 features.
Besides you can use features from the project itself like if the project is from oil industry or mining or manufacturing, etc. Some experts might be better in prediction in one industry over the others.
For classification, you can try random forests from sklearn.
Note that instead of classification (labelling if the project was pursued or disregarded) you can change the problem into a regression task by labelling the samples with the amount of profit or loss the company achieved from either pursuing (- or +) or disregarding (0) the project.
Hope this helps.
I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.
I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Weka contains two meta-classifiers of interest:
weka.classifiers.meta.CostSensitiveClassifier
weka.classifiers.meta.MetaCost
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.
The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.