Suggestions needed about the generalization of a regression neural network - machine-learning

I've trained a deep neural network of a few hundreds of features which analyzes geo data of a city, and calculate a score per sample based on the profile between the observer and the target location. That is, the longer the distance between the observer and target, the more features I will have for this sample. When I train my NN with samples from part of a city and test with other parts of the same city, the NN works very well, but when I apply my NN to other cities, the NN starts to give high standard deviation of errors, especially on cases which the samples of the city I'm applying the NN to generally has more features than samples of the city I used to train this NN. To deal with that, I've appended 10% of empty samples in training which was able to reduce the errors by half, but the remaining errors are still too large compare to the solutions calculated by hand. May I have some advise of generalize a regression neural network? Thanks!

I was going to ask for more examples of your data, and your network, but it wouldn't really matter.
How to improve the generalization of a regression neural network?
You can use exactly the same things you would use for a classification neural network. The only difference is what it does with the numbers that are output from the penultimate layer!
I've appended 10% of empty samples in training which was able to reduce the errors by half,
I didn't quite understand what that meant (so I'd still be interested if you expanded your question with some more concrete details), but it sounds a bit like using dropout. In Keras you append a Dropout() layer between your other layers:
...
model.append(Dense(...))
model.append(Dropout(0.2))
model.append(Dense(...))
...
0.2 means 20% dropout, which is a nice starting point: you could experiment with values up to about 0.5.
You could read the original paper or this article seems to be a good introduction with keras examples.
The other generic technique is to add some L1 and/or L2 regularization, here is the manual entry.
I typically use a grid search to experiment with each of these, e.g. trying each of 0, 1e-6, 1e-5 for each of L1 and L2, and each of 0, 0.2, 0.4 (usually using the same value between all layers, for simplicity) for dropout. (If 1e-5 is best, I might also experiment with 5e-4 and 1e-4.)
But, remember that even better than the above are more training data. Also consider using domain knowledge to add more data, or more features.

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Why do neural networks work so well?

I understand all the computational steps of training a neural network with gradient descent using forwardprop and backprop, but I'm trying to wrap my head around why they work so much better than logistic regression.
For now all I can think of is:
A) the neural network can learn it's own parameters
B) there are many more weights than simple logistic regression thus allowing for more complex hypotheses
Can someone explain why a neural network works so well in general? I am a relative beginner.
Neural Networks can have a large number of free parameters (the weights and biases between interconnected units) and this gives them the flexibility to fit highly complex data (when trained correctly) that other models are too simple to fit. This model complexity brings with it the problems of training such a complex network and ensuring the resultant model generalises to the examples it’s trained on (typically neural networks require large volumes of training data, that other models don't).
Classically logistic regression has been limited to binary classification using a linear classifier (although multi-class classification can easily be achieved with one-vs-all, one-vs-one approaches etc. and there are kernalised variants of logistic regression that allow for non-linear classification tasks). In general therefore, logistic regression is typically applied to more simple, linearly-separable classification tasks, where small amounts of training data are available.
Models such as logistic regression and linear regression can be thought of as simple multi-layer perceptrons (check out this site for one explanation of how).
To conclude, it’s the model complexity that allows neural nets to solve more complex classification tasks, and to have a broader application (particularly when applied to raw data such as image pixel intensities etc.), but their complexity means that large volumes of training data are required and training them can be a difficult task.
Recently Dr. Naftali Tishby's idea of Information Bottleneck to explain the effectiveness of deep neural networks is making the rounds in the academic circles.
His video explaining the idea (link below) can be rather dense so I'll try to give the distilled/general form of the core idea to help build intuition
https://www.youtube.com/watch?v=XL07WEc2TRI
To ground your thinking, vizualize the MNIST task of classifying the digit in the image. For this, I am only talking about simple fully-connected neural networks (not Convolutional NN as is typically used for MNIST)
The input to a NN contains information about the output hidden inside of it. Some function is needed to transform the input to the output form. Pretty obvious.
The key difference in thinking needed to build better intuition is to think of the input as a signal with "information" in it (I won't go into information theory here). Some of this information is relevant for the task at hand (predicting the output). Think of the output as also a signal with a certain amount of "information". The neural network tries to "successively refine" and compress the input signal's information to match the desired output signal. Think of each layer as cutting away at the unneccessary parts of the input information, and
keeping and/or transforming the output information along the way through the network.
The fully-connected neural network will transform the input information into a form in the final hidden layer, such that it is linearly separable by the output layer.
This is a very high-level and fundamental interpretation of the NN, and I hope it will help you see it clearer. If there are parts you'd like me to clarify, let me know.
There are other essential pieces in Dr.Tishby's work, such as how minibatch noise helps training, and how the weights of a neural network layer can be seen as doing a random walk within the constraints of the problem.
These parts are a little more detailed, and I'd recommend first toying with neural networks and taking a course on Information Theory to help build your understanding.
Consider you have a large dataset and you want to build a binary classification model for that, Now you have two options that you have pointed out
Logistic Regression
Neural Networks ( Consider FFN for now )
Each node in a neural network will be associated with an activation function for example let's choose Sigmoid since Logistic regression also uses sigmoid internally to make decision.
Let's see how the decision of logistic regression looks when applied on the data
See some of the green spots present in the red boundary?
Now let's see the decision boundary of neural network (Forgive me for using a different color)
Why this happens? Why does the decision boundary of neural network is so flexible which gives more accurate results than Logistic regression?
or the question you asked is "Why neural networks works so well ?" is because of it's hidden units or hidden layers and their representation power.
Let me put it this way.
You have a logistic regression model and a Neural network which has say 100 neurons each of Sigmoid activation. Now each neuron will be equivalent to one logistic regression.
Now assume a hundred logistic units trained together to solve one problem versus one logistic regression model. Because of these hidden layers the decision boundary expands and yields better results.
While you are experimenting you can add more number of neurons and see how the decision boundary is changing. A logistic regression is same as a neural network with single neuron.
The above given is just an example. Neural networks can be trained to get very complex decision boundaries
Neural networks allow the person training them to algorithmically discover features, as you pointed out. However, they also allow for very general nonlinearity. If you wish, you can use polynomial terms in logistic regression to achieve some degree of nonlinearity, however, you must decide which terms you will use. That is you must decide a priori which model will work. Neural networks can discover the nonlinear model that is needed.
'Work so well' depends on the concrete scenario. Both of them do essentially the same thing: predicting.
The main difference here is neural network can have hidden nodes for concepts, if it's propperly set up (not easy), using these inputs to make the final decission.
Whereas linear regression is based on more obvious facts, and not side effects. A neural network should de able to make more accurate predictions than linear regression.
Neural networks excel at a variety of tasks, but to get an understanding of exactly why, it may be easier to take a particular task like classification and dive deeper.
In simple terms, machine learning techniques learn a function to predict which class a particular input belongs to, depending on past examples. What sets neural nets apart is their ability to construct these functions that can explain even complex patterns in the data. The heart of a neural network is an activation function like Relu, which allows it to draw some basic classification boundaries like:
Example classification boundaries of Relus
By composing hundreds of such Relus together, neural networks can create arbitrarily complex classification boundaries, for example:
Composing classification boundaries
The following article tries to explain the intuition behind how neural networks work: https://medium.com/machine-intelligence-report/how-do-neural-networks-work-57d1ab5337ce
Before you step into neural network see if you have assessed all aspects of normal regression.
Use this as a guide
and even before you discard normal regression - for curved type of dependencies - you should strongly consider kernels with SVM
Neural networks are defined with an objective and loss function. The only process that happens within a neural net is to optimize for the objective function by reducing the loss function or error. The back propagation helps in finding the optimized objective function and reach our output with an output condition.

Why do I get good accuracy with IRIS dataset with a single hidden node?

I have a minimal example of a neural network with a back-propagation trainer, testing it on the IRIS data set. I started of with 7 hidden nodes and it worked well.
I lowered the number of nodes in the hidden layer to 1 (expecting it to fail), but was surprised to see that the accuracy went up.
I set up the experiment in azure ml, just to validate that it wasn't my code. Same thing there, 98.3333% accuracy with a single hidden node.
Can anyone explain to me what is happening here?
First, it has been well established that a variety of classification models yield incredibly good results on Iris (Iris is very predictable); see here, for example.
Secondly, we can observe that there are relatively few features in the Iris dataset. Moreover, if you look at the dataset description you can see that two of the features are very highly correlated with the class outcomes.
These correlation values are linear, single-feature correlations, which indicates that one can most likely apply a linear model and observe good results. Neural nets are highly nonlinear; they become more and more complex and capture greater and greater nonlinear feature combinations as the number of hidden nodes and hidden layers is increased.
Taking these facts into account, that (a) there are few features to begin with and (b) that there are high linear correlations with class, would all point to a less complex, linear function as being the appropriate predictive model-- by using a single hidden node, you are very nearly using a linear model.
It can also be noted that, in the absence of any hidden layer (i.e., just input and output nodes), and when the logistic transfer function is used, this is equivalent to logistic regression.
Just adding to DMlash's very good answer: The Iris data set can even be predicted with a very high accuracy (96%) by using just three simple rules on only one attribute:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
In general neural networks are black boxes where you never really know what they are learning but in this case back-engineering should be easy. It is conceivable that it learnt something like the above.
The above rules were found by using the OneR package.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training data,and endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

Resources