How to get indices of created samples in Imblearn - machine-learning

I am using different imblearn over-sampling methods on a data-set which contains ~55800 samples. About 200 are class 1, the rest class 0. I am oversampling class 1 with various over-sampling-strategies.
It does not improve my model quality and therefore I wan't to take a closer look at the generated samples. But how to access them? Any way to get the indices of the created ones?
Looping through the samples list before and after sampling, filtering out the non-duplicates, is way too demanding and freezes my laptop.

There is no built in function in imblearn to return the indices for oversampling as far as I know. Therefore the only solution is to get the indices by comparison of before and after as you suggested. In order not to freeze your laptop, you can neglect most of the majority class samples, since they are not used to create the oversampled samples of the minority class (at least not for random oversampling or normal SMOTE).
So lets say you delete all except 500 samples of class 0 and keep all 200 samples of class 1 and then perform the smote-oversampling and then compare like you tried before. With this number of samples it shouldn't freeze your laptop and you can get an idea of how the oversampled samples look like.

Related

What is the appropriate way to train a face detector / classifier?

I want to build a face detector/classifier to generate a network that detects whether a face is present in an image/video.
I understand the basic concept, but what I have problems with is the choice of the number of classes.
Initially, I thought that two classes (with face / without face) would be sufficient. However, I was unsure which data I should use for the class 'without face'. So I threw together datasets of equipment and plants and animals, whereupon the classes were very unbalanced, which is apparently not good.
Then I thought it would be better to use as many classes as possible.
But again, I am unsure what would be the best/common approach to the problem?
You can experiment with any number of samples and different images for the negative class. If the datasets with equipment/plant/places you have are imbalanced, you can try to subsample, e.g. pick 100 images from each.
Just don't make the negative class too huge, w.r.t the number of images with human samples you have. The rest is up to experimentation.

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

Sample Weight vs. Down Sampling for Handling Imbalanced Data

Suppose you have an imbalanced dataset. Without considering to generate new data to it, how can you handle it efficiently. I know we can use sample weight or down sampling. However, between these two, I am not sure which to choose. Also, suppose you need to build a classification model on this imbalanced data, how will these two techniques influence the model performances differently?
It totally depends upon that if you down sample how much of data observations do you have left, and how efficiently the down sampled class is able to accommodate the variety of the class down sampled.
for example, you have class 1 which consists of 100 observations and class 2 which contains 2000 observations (Class 1 is ~ 5%). Then downsampling wont make sense as there wont be enough data observations present to effetively implement a model. 100 observations are very less. The model would have high training errors.
But if you have Class 1 which has 100,000 observations and Class 2 having 2,000,000 (5% again) then it still make sense to downsample as you have enough observations to train model.
So the answer totally depends on the type of data you have. I personally would go with SMOTE. Hope this helps.

tackle class imbalance in single shot object detector

I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.

Does the ratio of two classes matter in classification problems?

I am building a sentiment analysis program using some tweets i have gathered. The labeled data which i have gathered would go through a neural network which classifies them into two classes, positive and negative.
The data is still being labeled. So far i have observed that the positive category has very small number of observations.
The records for positive category in my training set could be about 5% of the training data set (the same ratio could reflect in the population as well).
Would this create problems in the final "program" ?
Size of the data set is about 5000 records.
Yes, yes it can. There are two things to consider:
5% of 5000 is 250. Thus you will try to model data distribution of your class based on just 250 samples. This might be orders of magnitude to small for neural network. Consequently you might need 40x more data to have a representative sample of your data. While you can easily reduce the majority class through subsampling, without the big risk of destroing the structure - there is no way to get "more structure" from less points (you can replicate points, add noise etc. but this does not add structure, this just adds assumptions).
Class imbalance can also lead to convergence to naive solutions, like "always False" which has 95% accuracy. Here you can simply play around with the cost function to make it more robust to imbalance (in particular - train splits suggested by #PureW is nothing else like "black box" method of trying to change the loss function so it has bigger weight on minority class. When you have access to your classifier loss, like in NN you should not due this - but instead change the cost function and still keep all the data).
Without even splits of the different classes, you may want to introduce weights in your loss function such that errors in the smaller class are deemed more important.
Another solution, since 5000 samples may or may not be much data depending on your problem, could be to sample more datasets. You basically take this set of 5000 samples, and sample datapoints from it such that you have a new dataset with even split of the classes. This means the new dataset is only 10% of your original dataset. But it is evenly split between the classes. You can redo this sampling many times and end up with several datasets, useful in bootstrap aggregating.

Resources