Cleveland heart disease dataset - can’t describe the class - machine-learning

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
The dataset description says that the values go from 0 to 4 but the attribute description says:
0: < 50% coronary disease
1: > 50% coronary disease
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?

If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
References :
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.

is this dataset meant to be a multiclass or a binary classification problem?
Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?
Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Support vector machines for mutliple object categorization

I am trying to use linear SVMs for multi-class object category recognition. So far what I have understood is that there are mainly two approaches used- one-vs-all(OVA) and one-vs-one(OVO).
But I am having difficulty understanding its implementation. I mean the steps that I think is used are:
First the feature descriptors are prepared from let's say SIFT. So I have a 128XN feature vector.
Next to prepare a SVM classifier model for a particluar object category(say car), I take 50 images of car as the positive training set and total 50 images of rest categories taking randomly from each category (Is this part correct?). I prepare such models for all such categories (say 5 of them).
Next when I have an input image, do I need to input the image into all the 5 models and then check their values (+1/-1) for each of these models? I am having difficulty understanding this part.
In one-vs-all approach, you have to check for all 5 models. Then you can take the decision with the most confidence value. LIBSVM gives probability estimates.
In one-vs-one approach, you can take the majority. For example, you test 1 vs. 2, 1 vs. 3, 1 vs. 4 and 1 vs. 5. You classify it as 1 in 3 cases. You do the same for other 4 classes. Suppose for other four classes the values are [0, 1, 1, 2]. Therefore, class 1 was obtained most number of times, making that class as the final class. In this case, you could also do total of probability estimates. Take the maximum. That would work unless in one pair the classification goes extremely wrong. For example, in 1 vs. 4, it classifies 4 (true class is 1) with a confidence 0.7. Then just because of this one decision, your total of probability estimates may shoot up and give wrong results. This issue can be examined experimentally.
LIBSVM uses one vs. one. You can check the reasoning here. You can read this paper too where they defend one vs. all classification approach and conclude that it is not necessarily worse than one vs. one.
In short, your positive training samples are always the same. In one vs one you train n classifiers with negative samples from each of the negative classes taken separately. In one vs all you lump all negative samples together and train a single classifier.. The problem with the former approach is that you have to consider all n outcomes to decide on the class. The problem with the latter approach is that lumping al negativel object classes create may create a non homogeneous class that is hard to process and analyse.

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

Resources