Future-proofing feature scaling in machine learning? - machine-learning

I have a question about how feature scaling works after training a model.
Let's say a neural network model predicts the height of a tree by training on outside temperature.
The lowest outside temperature in my training data is 60F and the max is 100F. I scale the temperature between 0 and 1 and train the model. I save the model for future predictions. Two months later, I want to predict on some new data. But this time the min and max temperatures in my test data are -20F and 50F, respectively.
How does the trained model deal with this? The range I imposed the scaling on in the training set to generate my trained model does not match the test data range.
What would prevent me from hard-coding a range to scale to that I know the data will always be within, say from -50F to 130F? The problem I see here is if I have a model with many features. If I impose a different hard scale to each feature, using feature scaling is essentially pointless, is it not?

Different scales won't work. Your model trains for one scale, it learns one scale, if you change the scale, your model will still think it's the same scale and make very shifted predictions.
Training again will overwrite what was learned before.
So, yes, hardcode your scaling (preferentially directly on your data, not inside the model).
And for a quality result, train with all the data you can gather.

Related

Pytorch Feature Scaling within the Model or within the Dataloader

I am trying to conduct a simple feature scaling in PyTorch. For example, I have an image, and I want to scale certain pixel values down by 10. Now I have 2 options:
Directly divide those features by 10.0 in __getitem__ function in dataloader;
Pass the original features into the model forward function, but before pass them through trainable layers, scale down the corresponding features.
I have conducted several experiments, but observed after the first epoch, the validation losses between the two would start to diverge slightly. While after a couple hundreds of epochs, the two trained models would vary largely. Any suggestion on this?

Overfitting in convolutional neural network

I was applying CNN for classification of hand gestures I have 10 gestures and 100 images for each gestures. Model constructed by me was giving accuracy around 97% on training data, and I got 89% accuracy in testing data. Can I say that my model is overfitted or is it acceptable to have such accuracy graph(shown below)?
Add more data to training set
When you have a large amount of data(all kinds of instances) in your training set, it is good to create an overfitting model.
Example: Let's say you want to detect just one gesture say 'thumbs-up'(Binary classification problem) and you have created your positive training set with around 1000 images where images are rotated, translated, scaled, different colors, different angles, viewpoint varied, back-ground cluttered...etc. And if your training accuracy is 99%, your test accuracy will also be somewhere close.
Because our training set is big enough to cover all instances of the positive class, so even if the model is overfitted, it will perform well with the test set as the instances in the test set will only be a slight variation to that of the instances in the training set.
In your case, your model is good but if you can add some more data, you will get even better accuracy.
What kind of data to add?
Manually go through the test samples which the model got wrong and check for patterns if you can figure out what kind of samples are going wrong, you can add such kind to your training set and re-train again.

Estimating Object size using Deep Neural Network

I have a large dataset of vehicle images with the ground truth of their lengths (Over 100k samples). Is it possible to train a deep network to estimate vehicle length ?
I haven't seen any papers related to estimating object size using deep neural network.
[Update: I didn't notice computer-vision tag in the question, so my original answer was for a different question]:
Current convolutional neural networks are pretty good at identifying vehicle model from raw pixels. The technique is called transfer learning: take a general pre-trained model, such as VGGNet or AlexNet, and fine tune it on a vehicle data set. For example, here's a report of CS 231n course project that does exactly this (note: done by students, in 2015). No wonder there are apps out there already that do it in smartphone.
So it's more or less a solved problem. Once you know the model type, it's easy to look up it's size / length.
But if you're asking a more general question, when the vehicle isn't standard (e.g. has a trailer, or somehow modified), this is much more difficult, even for a human being. A slight change in perspective can result in significant error. Not to mention that some parts of the vehicle may be simply not visible. So the answer to this question is no.
Original answer (assumes the data is a table of general vehicle features, not the picture):
I don't see any difference between vehicle size prediction and, for instance, house price prediction. The process is the same (in the simplest setting): the model learns correlations between features and targets from the training data and then is able to predict the values for unseen data.
If you have good input features and big enough training set (100k will do),
you probably don't even need a deep network for this. In many cases that I've seen, a simplest linear regression produces very reasonable predictions, plus it can be trained almost instantly. So, in general, the answer is "yes", but it boils down to what particular data (features) you have.
You may do this under some strict conditions.
A brief introduction to Computer Vision / Multi-View Geometry:
Based on the basics of the Multi-View Geometry, the main problem of identifying of the object size is finding the conversion function from camera view to real world coordinates. By applying different conditions (i.e. capturing many sequential images - video / SfM -, taking same object's picture from different angles), we can estimate this conversion function. Hence, this is completely dependent on camera parameters like focal length, pixel width / height, distortion etc.
As soon as we have the camera to real world conversion function, it is super easy to calculate camera to point distance, hence the object's size.
So, based on your current task, you need to supply
image
camera's intrinsic parameters
(optionally) camera's extrinsic parameters
and get the output that you desire hopefully.
Alternatively, if you can fix the camera (same model, same intrinsic / extrinsic parameters), you can directly find the correlation between same camera's image and distance / object sizes just by giving the image as the only input. However, the NN will most probably will not work for different cameras.

Do you apply min max scaling separately on training and test data?

While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data?
Or do you split first and then apply min max on each set, using the min and max values from that specific set?
Lastly , when making a prediction on a new input, should the features of that input be normalized using the min, max values from the training data before being fed into the network?
Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.
To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.

How can I train a naivebayes classifier incrementally?

Using Accord.NET I've created a NaiveBayes classifier. It will classify a pixel based on 6 or so sets of image processing results. My images are 5MP, so a training set of 50 images creates a very large set of training data.
6 int array per pixel * 5 million pixels * 50 images.
Instead of trying to store all that data in memory, is there a way to incrementally train the NaiveBayes classifier? Calling Learn() multiple times overwrites the old data each time rather than adding to it.
Right now is not possible to train a Naive Bayes model incrementally using Accord.NET.
However, since all that Naive Bayes is going to do is to try to fit some distributions to your data, and since your data has very few dimensions, maybe you could try to learn your model on a subsample of your data rather than all of it at once.
When you go loading images to build your training set, you can try to randomly discard x% of the pixels in each image. You can also plot the classifier accuracy for different values of x to find the best balance between memory and accuracy for your model (hint: for such a small model and this large amount of training data, I expect that it wont make that much of a difference even if you dropped 50% of your data).

Resources