Machine learning testing data - machine-learning

I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on testing data. It's a binary classification problem. If I know the proportions of classes in test data how could I use it to improve my model or improve predictions made by the model?
So let's say 75% belong to class 1 and 25% to class 0 of the testing data.
Any help is greatly appreciated
Thanks

Well, the first thing first is that your data should be balanced. And often in machine learning problem paradigm test data is treated as something that you know nothing about.
Any kind of information regarding improving your model by using some held out dataset is done by validation dataset.
Look for Validation Dataset. Why you need Validation Dataset, Balancing of the dataset. These terms will help you to proceed further.

There has been two different approaches to addressing imbalanced data: algorithm-level and data-level approach.
Algorithm approach: As mentioned above, ML algorithms penalize False Positives and False Negatives equally. A way to counter that is to modify the algorithm itself to boost predictive performance on minority class. This can be executed through either recognition-based learning or cost-sensitive learning. Feel free to check Drummond & Holte (2003); Elkan (2001); and Manevitz & Yousef (2001) in case you want to learn more about the topic.
Data approach: This consists of re-sampling the data in order to mitigate the effect caused by class imbalance. The data approach has gained popular acceptance among practitioners as it is more flexible and allows for the use of latest algorithms. The two most common techniques are over-sampling and under-sampling.
Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to overfitting.
Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.
for more reference visit: https://medium.com/james-blogs/handling-imbalanced-data-in-classification-problems-7de598c1059f

Related

Destribution of classes in training set

When making a predictive model (specificly in telecommunication regarding churn), is it essential to have a 1:1 split between the classes in the training set(the actual distribution is more like 1:50)? When reading on what other people have done this seems to be the case. But they dont neccesarily state it as a requirement. What is recommended?
Your problem is frequently referred to as "Class Imbalance". Whether and how it will impact your result depends on the algorithm and the evaluation metric you use. The logistic regression algorithm, and the model accuracy, for example, can be very susceptible to this problem. Simple envelope models, and the model AUC, on the other hand, are more resilient against class imbalance. I am aware of five broad possible approaches to deal with this:
1) Up-sampling: Basically artificially increase the number of the rare class. This may be the go-to solution when you have very little data but you are confident that it is quite representative of the wider population.
2) Down-sampling: Just leave out a part of the abundant class. This is an option when you have a very large quantity of data.
3) Weighting: Telling your algorithm to give more importance to the information obtained from the rare class.
4) Bagging: Here, you are randomly sub-sampling your data and fitting "weak" learners to each subsample. Later, these weak learners are aggregated to create one final prediction.
5) Boosting: Similar to bagging, but each "weak" learner is not agnostic to the previously fitted one. Instead, they take the residuals from the latest ensemble.
There is a really nice article here that goes through these in great detail, including some worked examples in R, and another one here which focuses more on python

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Deep learning Training dataset with Caffe

I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.
Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.

How to choose feature selection method? By data or some rules?

I have been using some feature selection methods individually, e.g.RFE OR Select K best, for multi-label classification. Is there a technique or method can be used into choosing a feature selection method dynamically? for instance, according to the statistics of test data or some rule-based approach?
This probably isn't the answer you're looking for, but you could try each one and cross validate it against some test data. It should be fairly trivial to script this.
I don't know of any better way of picking a feature selection algorithm than this, but it can bias you towards the test data you've used.
These answers may help.
My assumption about the feature statistics is: maximal distances between means of values between the classes and minimal variance of values for one class classify a good feature.
I start with small learning set, test this assumption and increase the learning set if results look promising.
The final optimization is the histogram of means comparison. Features with similar histograms are removed. Those are redundant features which decrease (at least on SVM) the accuracy considerable (5-10%).
With this approach I gain 95% of accuracy on my data-set of 5 classes, 600 instances. The training takes < 1h. Manual training used to gained 98% with many days of experimenting.

Document Classification using Naive Bayes classifier

I am making a document classifier in mahout using the simple naive bayes algorithm. Currently, 98% of the data(documents) I have is of Class A and only 2% is of class B. My question is, since there is such a wide gap in the percentage of Class A docs vs Class B docs, would the classifier be able to train accurately still?
What I'm thinking of doing is ignoring a whole bunch of Class A documents and "manipulating" the dataset I have so that there isn't such a wide gap in the composition of the documents. Thus, the dataset I'll end up having will consist 30% of Class B and 70% of Class A. But, are there any repercussions of doing that I am not aware of?
A lot of this gets into how good "accuracy" is as a measure of performance, and that depends on your problem. If misclassifying "A" as "B" is just as bad/ok as misclassifying "B" as "A", then there is little reason to do anything other than just mark everything as "A", since you know it will reliably get you a 98% accuracy (so long as that unbalanced distribution is representative of the true distribution).
Without knowing your problem (and if accuracy is the measure you should use), the best answer I could give is "it depends on the data set". It is possible that you could get past 99% accuracy with standard naive bays, though it may be unlikely. For Naive Bayes in particular, one thing you could do is to disable the use of priors (the prior is essentially the proportion of each class). This has the effect of pretending that every class is equally likely to occur, though the model parameters will have been learned from uneven amounts of data.
Your proposed solution is a common practice, it sometimes works well. Another practice is to create fake data for the smaller class (how would depend on your data, for text documents I'm not aware of any particularly good way). Another practice is to increase the weights of the data points in the under-represented classes.
You can search for "imbalanced classification" and find a lot more information about these types of problems (they are one of the harder ones).
If accuracy is not actually a good measure for your problem, you can search for more information about "cost sensitive classification" which should be helpful.
You should not necessarily sample dataset A to reduce its instances. Several methods are available for efficient learning from imbalanced datasets, such as Majority Undersampling (exactly what you did), Minority Oversampling, SMOTE, and etc. Here is an empirical comparison of these methods: http://machinelearning.org/proceedings/icml2007/papers/62.pdf
Alternatively, you may define a custom cost matrix for the classifier. In other words, assuming B=Positive class, you may define cost(False Positive) < cost(False Negative). In this case, the classifier's output will bias towards the positive class. Here is a very helpful tutorial: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.4418&rep=rep1&type=pdf

Resources