Can you have TOO MUCH training data or not?
I am working on a system that will update training data when a user gives it feedback of a mistake it has made in an attempt to not make the same mistake again (i.e if the user looks a little different to their usual training images, it will add the new capture of them to training data).
Will this decrease performance at all? Should there be a maximum? Would it be better just to have the same training set and just accept the fail rate instead of trying to improve it?
Cheers!
Depending on how different the user looks, this could be a problem.
lets say the user is wearing sunglasses, looks the wrong way,and wears a scarf.
This would occlude too much of the image to properly determine if this is a face or not.
Training on such images would provide horrendous results overall, because they are not something that qualifies as a face, or at least not according to the theories provided for eigenfaces.
If you want to keep training a model according to feedback, I think you should at least have a person check the images and decide if they are worth training.
But, if you have trained the model with a proper dataset to begin with, almost all the feedback you would receive would never properly qualify as a face. because if they did, the model would not have failed in the first place.
regarding a maximum, If I recall correctly, there is not a hard limit you should respect, but up to a certain point, the amount of time needed to retrain the model would become absurtly long, which could be unwanted for your specific situation.
I hope this made any sense to you, If you have any more questions about my answer, just leave a comment.
Related
Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!
I have this project I'm working on. A part of the project involves multiple test runs during which screenshots of an application window are taken. Now, we have to ensure that screenshots taken between consecutive runs match (barring some allowable changes). These changes could be things like filenames, dates, different logos, etc. within the application window that we're taking a screenshot of.
I had the bright idea to automate the process of doing this checking. Essentially my idea was this. If I could somehow mathematically quantify the difference between a screenshot from the N-1th run and the Nth run, I could create a binary labelled dataset that mapped feature vectors of some sort to a label (0 for pass or 1 for fail if the images do not adequately match up). The reason for all of this was so that my labelled data would help make the model understand what scale of changes are acceptable, because there are so many kinds that are acceptable.
Now lets say I have access to lots of data that I have meticulously labelled, in the thousands. So far I have tried using SIFT in opencv using keypoint matching to determine a similarity score between images. But this isn't an intelligent, learning process. Is there some way I could take some information from SIFT and use it as my x-value in my dataset?
Here are my questions:
what would that be the information I need as my x-value? It needs to be something that represents the difference between two images. So maybe the difference between feature vectors from SIFT? What do I do when those vectors are of slightly different dimensions?
Am I on the right track with thinking about using SIFT? Should I look elsewhere and if so where?
Thanks for your time!
The approach that is being suggested in the question goes like this -
Find SIFT features of two consecutive images.
Use those to somehow quantify the similarity between two images (sounds reasonable)
Use this metric to first classify the images into similar and non-similar.
Use this dataset to train a NN do to the same job.
I am not completely convinced if this is a good approach. Let's say that you created the initial classifier with SIFT features. You are then using this data to train a NN. But this data will definitely have a lot of wrong labels. Because if it didn't have a lot of wrong labels, what's stopping you from using your original SIFT based classifier as your final solution?
So if your SIFT based classification is good, why even train a NN? On the other hand, if it's bad, you are giving a lot of wrong labeled data to the NN for training. I think the latter is a probably a bad idea. I say probably because there is a possibility that maybe the wrong labels just encourage the NN to generalize better, but that would require a lot of data, I imagine.
Another way to look at this is, let's say that your initial classifier is 90% accurate. That's probably the upper limit of the performance for the NN that you are looking at when talking about training it with this data.
You said that the issue that you have with your first approach is that 'it's not a an intelligent, learning process'. I think it's the wrong approach to think that the former approach is always inferior to the latter. SIFT is a powerful tool that can solve a lot of problems without all the 'black-boxness' of an NN. If this problem can be solved with sufficient accuracy using SIFT, I think going after a learning based approach is not the way to go, because again, a learning based approach isn't necessarily superior.
However, if the SIFT approach isn't giving you good enough results, definitely start thinking of NN stuff, but at that point, using the "bad" method to label the data is probably a bad idea.
Also in relation, I think you could potentially be underestimating the amount of data that is needed for this. You mentioned data in the thousands, but that's honestly, not a lot. You would need a lot more, I think.
One way I would think about instead doing this -
Do SIFT keyponits detection for a sample reference image.
Manually filter out keypoints that does not belong to the things in the image that are invariant. That is, just take keypoints at the locations in the image that is guaranteed (or very likely) to be always present.
When you get a new image, compute the keypoints and do matching with the reference image.
Set some threshold of the ratio of good matches to the total number of matches.
Depending on your application, this might give you good enough results.
If not, and if you really want your solution to be NN based, I would say you need to manually label the dataset as opposed to using SIFT.
By “Cold Start” I mean that often computer vision models for object detection or semantic segmentation require about 5000 images per class. So if an idea if floated within the company for e.g. we want to use object detection to count the number of wood logs when the truck is dispatched and then use the same app to count the number that is received.
So now the challenge is that you have only a few images of woods logs on a truck but to train any model you need thousands, so what do practitioners typically do for these prototypes?
Because at this stage it is not clear what model to try? It is also not very feasible to ask business to invest in collecting thousands of images of logs and label them?
That is why I am calling this “Cold Start”. How do you start?
What I have looked into is Conditional GANs, Pix-2-Pix but I am trying to understand the recommended method on how to start when you have very few images per object class.
I expect that when I drop a few images in a folder and call this library I end up getting a lot more images per class so I can then start my prototyping.
Note that asking for software libraries is specifically off-topic here.
No, there is no magic solution: if your data set doesn't have enough information in its images to train a hand-crafted model, no amount of software will change that fact. However, the first approach is to challenge that "fact": how do you know that you don't have enough images? What happened when you used what you have to train a model? You will train for more epochs before the model converges, but you should be able to achieve far better than random accuracy by training a comparable quantity of iterations.
I seriously doubt that you'll need to collect and label thousands of images: you have a very restricted paradigm, photos of log trucks taken from an vantage point you control. Training a model to count non-overlapping near-circles will take much less differentiation than, say, distinguishing motor vehicles from postal boxes.
Experiment with the basic models you have at hand -- you already have much more of the solution than you realize. If your data set is too small, go out the yard with a digital camera and get twice as many, three times, whatever you need. Flip the images left-right to get more input.
Does that get you moving?
Transfer learning solves the problem you are describing as "Cold Start". Basically you can import the weights obtained after training using a big and open dataset and just fine-tune them using the smaller dataset you already have. Data augmentation, freezing some of the layers, etc may help improving the results of a fine-tuned model.
I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)
I am using FCN (Fully Convolutional Networks) and trying to do image segmentation. When training, there are some areas which are mislabeled, however further training doesn't help much to make them go away. I believe this is because network learns about some features which might not be completely correct ones, but because there are enough correctly classified examples, it is stuck in local minimum and can't get out.
One solution I can think of is to train for an epoch, then validate the network on training images, and then adjust weights for mismatched parts to penalize mismatch more there in next epoch.
Intuitively, this makes sense to me - but I haven't found any writing on this. Is this a known technique? If yes, how is it called? If no, what am I missing (what are the downsides)?
It highly depends on your network structure. If you are using the original FCN, due to the pooling operations, the segmentation performance on the boundary of your objects is degraded. There have been quite some variants over the original FCN for image segmentation, although they didn't go the route you're proposing.
Just name a couple of examples here. One approach is to use Conditional Random Field (CRF) on top of the FCN output to refine the segmentation. You may search for the relevant papers to get more idea on that. In some sense, it is close to your idea but the difference is that CRF is separated from the network as a post-processing approach.
Another very interesting work is U-net. It employs some idea from the residual network (RES-net), which enables high resolution features from lower levels can be integrated into high levels to achieve more accurate segmentation.
This is still a very active research area. So you may bring the next break-through with your own idea. Who knows! Have fun!
First, if I understand well you want your network to overfit your training set ? Because that's generally something you don't want to see happening, because this would mean that while training your network have found some "rules" that enables it to have great results on your training set, but it also means that it hasn't been able to generalize so when you'll give it new samples it will probably perform poorly. Moreover, you never talk about any testing set .. have you divided your dataset in training/testing set ?
Secondly, to give you something to look into, the idea of penalizing more where you don't perform well made me think of something that is called "AdaBoost" (It might be unrelated). This short video might help you understand what it is :
https://www.youtube.com/watch?v=sjtSo-YWCjc
Hope it helps