Training a model using data set - machine-learning

I have a model, that needs to train with real world data that I am acquiring daily. In every 3 or 4 days, I can prepare around 500 images for training. So, I must start training and checking the model just after getting 500 images. Meanwhile I will acquire another 500 images and so on. Whether it is OK to train with first 500 data set and save the model weights and continue train with latest 500 data set by using saved weights?

This is basically like transfer learning. You take a pre-trained model and fine-tune it on your new data. You will have to save the model and its weight and then load them back and train on the new data like you would normally. This is a common practice.

You have two options - effectively engage in transfer learning (as mentioned above) OR, if you really believe old data + new data = the best possible data set for you to train on, consider retraining from scratch on the complete data set (old data + new data). The latter gives all data, new and old, an equally fair shake, which is not necessarily true of transfer learning. Although I have to question your need to do this every 3 or 4 days - if your problem is well formulated and your model design is good, at some point you should have enough data that the model trained on that data generalizes well enough that continuously giving it more data will no longer improve the performance significantly. Also, if the model will perform significantly better having been trained on 2000 images than 500 images, why not just wait a couple more weeks until you have 2000 images before releasing it into the real world? Obviously this depends on your task and area of industry, so you may well have a good reason that I'm not aware of, but it's worth thinking about.

Related

YoloV5 Custom retraining

I trained my custom data set in the yoloV5s model and I got 80% accuracy on my inference. Now I need to increase the accuracy by adding more images and labels.
My question here is, I already trained 10,000+ labels to reach 80% it took 7 hours for me. Shall I need to include the old 10,000+ data with my new data which is only 1000 to train and improve my accuracy?
Is there any way that I can include the new data only to retrain the model even I add a new class?
How can I save my time and space?
The question you are asking is of topic continual learning, which is an active area of research nowadays. Since you need to add more classes to your model, you need to add the new class with the previous data and retrain the model from start. If you don't do that, i.e., you only train on the new class, your model will forget completely about the previous data (learned feature); this forgetting is known as Catastrophic Forgetting.
Many people have suggested various ways to avoid this Catastrophic forgetting; I personally feel that Progressive Neural Network is highly immune to Forgetting. Apart from it, you can find other methods here
As I told you, this is currently a highly active area of research; There is no full-proof solution. For now, the best way is to add the new data to the previous data and retrain your model.

How to tell if you can apply Machine Learning to a project?

I am working on a personal project in which I log data of the bike rental service my city has in a MySQL database. A script runs every thirty minutes and logs data for every bike station and the free bikes each one has. Then, in my database I average the availability of each station for each day at that given time making it, as today, an approximate prediction with 2 months of data logging.
I've read a bit on machine learning and I'd like to learn a bit. Would it be possible to train a model with my data and make better predictions with ML in the future?
The answer is very likely yes.
The first step is to have some data, and it sounds like you do. You have a response (free bikes) and some features on which it varies (time, location). You have already applied a basic conditional means model by averaging values over factors.
You might augment the data you know about locations with some calendar events like holiday or local event flags.
Prepare a data set with one row per observation, and benchmark the accuracy of your current forecasting process for a period of time on a metric like Mean Absolute Percentage Error (MAPE). Ensure your predictions (averages) for the validation period do not include any of the data within the validation period!
Use the data for this period to validate other models you try.
Split off part of the remaining data into a test set, and use the rest for training. If you have a lot of data, then a common training/test split is 70/30. If the data is small you might go down to 90/10.
Learn one or more machine learning models on the training set, checking performance periodically on the test set to ensure generalization performance is still increasing. Many training algorithm implementations will manage this for you, and stop automatically when test performance starts to decrease due to overfitting. This a big benefit of machine learning over your current straight average, the ability to learn what generalizes and throw away what does not.
Validate each model by predicting over the validation set, computing the MAPE and compare the MAPE of the model to that of your original process on the same period. Good luck, and enjoy getting to know machine learning!

H2o: Iterating over data bigger than memory without loading all data into memory

Is there a way I can use H2O to iterate over data that is larger than the cumulative memory size of the cluster? I have a big-data set which I need to iterate through in batches and feed into Tensorflow for gradient-descent. At a given time, I only need to load one batch (or a handful) in memory. Is there a way I can setup H2O to perform this kind of iteration without it loading the entire data-set into memory?
Here's a related question that was answered over a year ago, but doesn't solve my problem: Loading data bigger than the memory size in h2o
The short answer is this isn't what H2O was designed to do.
So unfortunately the answer today is no.
The longer answer... (Assuming that the intent of the question is regarding model training in H2O-3.x...)
I can think of at least two ways one might want to use H2O in this way: one-pass streaming, and swapping.
Think of one-pass streaming as having a continuous data stream feeding in, and the data constantly being acted on and then thrown away (or passed along).
Think of swapping as the computer science equivalent of swapping, where there is fast storage (memory) and slow storage (disk) and the algorithms are continuously sweeping over the data and faulting (swapping) data from disk to memory.
Swapping just gets worse and worse from a performance perspective the bigger the data gets. H2O isn't ever tested this way, and you are on your own. Maybe you can figure out how to enable an unsupported swapping mode from clues/hints in the other referenced stackoverflow question (or the source code), but nobody ever runs that way, and you're on your own. H2O was architected to be fast for machine learning by holding data in memory. Machine learning algorithms iteratively sweep over the data again and again. If every data touch is hitting the disk, it's just not the experience the in-memory H2O-3 platform was designed to provide.
The streaming use case, especially for some algorithms like Deep Learning and DRF, definitely makes more sense for H2O. H2O algorithms support checkpoints, and you can imagine a scenario where you read some data, train a model, then purge that data and read in new data, and continue training from the checkpoint. In the deep learning case, you'd be updating the neural network weights with the new data. In the DRF case, you'd be adding new trees based on the new data.

Add data to retrained Inception net

I did few experiments with Google's Inception-v3 net from the tutorial (https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html)
If I have a large enough data set, then it's fine. But what about when a data set is relatively small and is growing on the go (roughly 10% a day)?
Is there a way to add more data points to the retrained net?
I don't think that retraining a whole model each time we get a new data point doesn't seem efficient.
You can think of each day's data as a large batch. Tensorflow uses SGD that naturally supports this kind of training input.
You can just save your model to disk after you finish each day's training and load yesterday's model before each day's traning.
There are checkpoints in TensorFlow if you want to pause and resume. Another option is to train different categories on different layers. It's possible to use your outputs from image retraining as inputs. Better hardware should also be considered.

How do you add new categories and training to a pretrained Inception v3 model in TensorFlow?

I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.

Resources