C4.5 Decision Tree Algorithm doesn't improve the accuracy - machine-learning

I ran C4.5 Pruning algorithm in Weka using 10-fold cross validation. I noticed that the unpruned tree had a higher testing accuracy than a pruned tree. I couldn't understand the reason about why pruning the tree didn't improve the testing accuracy?

Pruning reduces the size of the decision tree which (in general) reduces training accuracy but improves the accuracy on test (unseen) data. Pruning helps to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) fails whenever it sees unseen data.
So, pruning should improve testing accuracy. From your question, its difficult to say why pruning is not improving the testing accuracy.
However, you can check your training accuracy. Just check whether pruning is reducing the training accuracy or not. If not, then the problem is somewhere else. Probably then you need to think about the number of features or the dataset size!

Related

What is meant by stability in relation to neural networks

I hear the terms stability/instability thrown around a lot when reading up on Deep Q Networks. I understand that stability is improved with the addition of a target network and replay buffer but I fail to understand exactly what it's refering to.
What would the loss graph look like for an instable vs stable neural network?
What does it mean when a neural network converges/diverges?
Stability, also known as algorithmic stability, is a notion in
computational learning theory of how a machine learning algorithm is
perturbed by small changes to its inputs. A stable learning algorithm
is one for which the prediction does not change much when the training
data is modified slightly.
Here Stability means suppose you have 1000 training data that you use to train the model and it performs well. So in terms of model stability if you train the same model with 900 training data the model should still perform well , thats why it is also called as algorithmic stability.
As For the loss Graph if the model is stable the loss graph probably should be same for both size of training data (1000 & 900). And different in case of unstable model.
As in Machine learning we want to minimize loss so when we say a model converges we mean to say that the model's loss value is within acceptable margin and the model is at that stage where no additional training would improve the model.
Divergence is a non-symmetric metrics which is used to measure the difference between continuous value. For example you want to calculate difference between 2 graphs you would use Divergence instead of traditional symmetric metrics like Distance.

The impact of number of negative samples used in a highly imbalanced dataset (XGBoost)

I am trying to model a classifier using XGBoost on a highly imbalanced data-set, with a limited number of positive samples and practically infinite number of negative samples.
Is it possible that having too many negative samples (making the data-set even more imbalanced) will weaken the model's predictive power? Is there a reason to limit the number of negative samples aside from running time?
I am aware of the scale_pos_weight parameter which should address the issue but my intuition says even this method has its limits.
To answer your question directly: adding more negative examples will likely decrease the decision power of the trained classifier. For the negative class choose the most representative examples and discard the rest.
Learning from imbalanced dataset can influence the predictive power and even an ability of a classifier to converge at all. Generally recommended strategy is to maintain similar sizes of training examples per each of the classes. Imbalance of classes effect on learning depends on the shape of the decision space and the width of boundaries between classes. The wider they are, and the simpler the decision space the more successful training even for imbalanced datasets.
TL;DR
For a quick overview of the methods of imbalanced learning I recommend these two articles:
SMOTE and AdaSyn by example
How to Handle Imbalanced Data: An Overview
Dealing with Imbalanced Classes in Machine Learning
Learning from Imbalanced Data by Prof. Haibo He (more scientific)
There is a Python package called imbalanced-learn which has an extensive documentation of algorithms that I recommend for in-depth review.

Does it overfit if the nested models are trained on the same data

Does it overfit if I build a machine learning model where it use the output from another machine learning model while both models are trained on the same data?
Basically I was wondering if I can use the KNN prediction result as an input for a deep neural network model while both of the models are trained on the very same data.
Nesting machine learning models is possible. For example, neuronal networks can be seen as multiple nested perceptrons (see https://en.wikipedia.org/wiki/Perceptron).
However you are right - nesting machine learning models increase the VC-dimension (https://en.wikipedia.org/wiki/VC_dimension) of your complete machine learning system and thus the risk of overfitting.
In practice cross-validation is often used in order to reduce the risk of overfitting.
Edit:
#MatiasValdenegro +1 for pointing towards a point I do not specify very clearly in my answer. Pure cross-validation can indeed only be used in order to detect overfitting.
However when we training certain machine learning systems like neuronal networks, it is possible to use some sort of cross-validation in order to reduce the risk of overfitting. In order to do so, we simply discard e.g. 10% of the training data for training. Then after each training round, the trained machine learning system is evaluated on the discarded training data. Once the trained neuronal network is getting worse on the discarded part, the training algorithm stops. This is for example done by the python pybrain (http://pybrain.org/) library.

Machine learning : RandomForest data pre-processing

Before fitting a RandomForest what should be done with continuous features, should they be standard scaled?
No decision trees approach or Random Forests for that matter don't really care whether they are dealing with continuous data or discrete data. So even if you don't standardize it wont be a issue.

Testing an image processing algorithm on noisy data

I wrote an image processing program that train some classifier to recognize some object in the image. now I want to test the response of my algorithm to noise. I wish the algorithm have some robustness to noise.
My question is that, should I train the classifier using noisy version of train dataset, or train the classifier using original version of dataset, and see its performance on noisy data.
Thank you.
to show robustness of classifier one might use highly noisy test data on the originally trained classifier. depending on that performance, one can train again using noisy data and then test again. obviously for an application development, if including extremely noisy samples increase accuracy then that's the way to go. literature says to have as large a range of training samples as possible. however sometimes this degrades performances in specific cases.

Resources