Gradient Descent for Multi-Level / Mixed / Hierarchical Regression Model

Gradient Descent for Multi-Level / Mixed / Hierarchical Regression Model - machine-learning

How would gradient descent work in a multilevel regression setting? This is fairly clear to me in a standard linear regression formulation, but haven't been able to wrap my head around parameter updates in hierarchical models.
In short, I'd like to be able to make sequential updates to an existing model in a online-ish format (we'll see new data every t timesteps, and would actually like to bias the updates towards newer observations but that's potentially another conversation), where the idea would be to have better informed parameter estimates over time. My thought was that hand-coding a gradient descent update step would be a potential solution, but need to figure out how this would work given the hierarchical structure of the data.

Related

confused between choosing linear or nonlinear regression to model this data

I plotted the data I wanted to make a model for and the result is as shown in the picture. I tried modeling it using a Sinc-function but i failed so
if anyone has an idea that would help. https://i.stack.imgur.com/QY17L.jpg

First, please note that while using linear regression for this problem may let you fit these datapoints, it's not going to provide any information really about any future data. It may fit your test data if all of it lies on this same curve. If you're looking for something to predict future price you might want to consider a time series model.
However, if you're just trying use a linear regression to fit this data to that curve you have to get slightly creative. If all your features are linear and you're using linear reg, then one way or another you'll end up with a linear answer, which won't fit this model. So you will need to make your own custom features from your data. You can probably get a pretty good approximation for this using a 10th degree polynomial. So your features could be X (Years since 1992), X^2 (Years since 1992, squared), X^3, X^4, X^5, X^6 ... X^10.
A variety of other classifiers would also work fine, but you'll probably need to use some sort of time series model (LSTM for example) to get anything you can generalize to predict the future.

Sequential or batch parameters estimation

This is the problem that I should describe. Unfortunately the only one technique that I studied to estimate the parameters in the linear regression is the classic gradient descent algorithm. Is that one of "batch" or "sequential" mode ? And what is the difference between them ?

I wasn't expecting to find exactly the question from the ML exam here! Well the point is that as James Phillips says the gradient descent is an iterative method, so called sequential. The gradient descent is just an iterative optimization algorithm for finding the minimum of a function but you could use it to find the 'best-fitting line'. A complete batch way will be e.g. the Linear Least Squares method applying all the equations at once. You can find all the parameters calculating the partial derivatives of the sum of the square of the errors w.r.t. the best line fit and setting them to zero. Of course, as Phillips said it is not a convenient method, it's more a theoretical definition. Hope, it is useful.

From Liang et al. "A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks":
Batch learning is usually a time consuming affair as it may involve many iterations through the training data. In most applications, this may take several minutes to several hours and further the learning parameters (i.e., learning rate, number of learning epochs, stopping criteria, and other predefined parameters) must be properly chosen to ensure convergence. Also, whenever a new data is received batch learning uses the past data together with the new data and performs a retraining, thus consuming a lot of time. There are many industrial applications where online sequential learning algorithms are preferred over batch learning algorithms as sequential learning algorithms do not require retraining whenever a new data is received. The back-propagation (BP) algorithm and its variants have been the backbone for training SLFNs with additive hidden nodes. It is to be noted that BP is basically a batch learning algorithm. Stochastic gradient descent BP (SGBP) is one of the main variants of BP for sequential learning applications.
Basically, gradient descent is theorized in a batch way, but in practice you use iterative methods.
I think the question doesn't ask you to show two ways (batch and sequential) to estimate the parameters of the model, but instead to explain—either in a batch or sequential mode—how such an estimation would work.
For instance, if you are trying to estimate parameters for a linear regression model, you could just describe likelihood maximization, which is equivalent to minimize the least square error:
If you want to show a sequential mode, you can describe the gradient descent algorithm.

Why do we need epochs?

In courses there is nothing about epochs, but in practice they are everywhere used.
Why do we need them if the optimizer finds the best weight in one pass. Why does the model improve?

Generally whenever you want to optimize you use gradient descent. Gradient descent has a parameter called learning rate. In one iteration alone you can not guarantee that the gradient descent algorithm would converge to a local minima with the specified learning rate. That is the reason why you iterate again for the gradient descent to converge better.
Its also a good practice to change learning rates per epoch by observing the learning curves for better convergence.

Why do we need [to train several epochs] if the optimizer finds the best weight in one pass?
That's wrong in most cases. Gradient descent methods (see a list of them) does usually not find the optimal parameters (weights) in one pass. In fact, I have never seen any case where the optimal parameters were even reached (except for constructed cases).
One epoch consists of many weight update steps. One epoch means that the optimizer has used every training example once. Why do we need several epochs? Because gradient descent are iterative algorithms. It improves, but it just gets there in tiny steps. It only uses tiny steps, because it can only use local information. It does not have an idea of the function besides the current point at which it is.
You might want to read the gradient descent part of my optimization basics blog post.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?

What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.

Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.

You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules

I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)

If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Neural networks - why so many learning rules?

I'm starting neural networks, currently following mostly D. Kriesel's tutorial. Right off the beginning it introduces at least three (different?) learning rules (Hebbian, delta rule, backpropagation) concerning supervised learning.
I might be missing something, but if the goal is merely to minimize the error, why not just apply gradient descent over Error(entire_set_of_weights)?
Edit: I must admit the answers still confuse me. It would be helpful if one could point out the actual difference between those methods, and the difference between them and straight gradient descent.
To stress it, these learning rules seem to take the layered structure of the network into account. On the other hand, finding the minimum of Error(W) for the entire set of weights completely ignores it. How does that fit in?

One question is how to apportion the "blame" for an error. The classic Delta Rule or LMS rule is essentially gradient descent. When you apply Delta Rule to a multilayer network, you get backprop. Other rules have been created for various reasons, including the desire for faster convergence, non-supervised learning, temporal questions, models that are believed to be closer to biology, etc.
On your specific question of "why not just gradient descent?" Gradient descent may work for some problems, but many problems have local minima, which naive gradient descent will get stuck in. The initial response to that is to add a "momentum" term, so that you might "roll out" of a local minimum; that's pretty much the classic backprop algorithm.

First off, note that "backpropagation" simply means that you apply the delta rule on each layer from output back to input so it's not a separate rule.
As for why not a simple gradient descent, well, the delta rule is basically gradient descent. However, it tends to overfit the training data and doesn't generalize as efficiently as techniques which don't try to decay the error margin to zero. This makes sense because "error" here simply means the difference between our samples and the output - they are not guaranteed to accurately represent all possible inputs.

Backpropagation and naive gradient descent also differ in computational efficiency. Backprop is basically taking the networks structure into account and for each weight only calculates the actually needed parts.
The derivative of the error with respects to the weights is splitted via the chainrule into: ∂E/∂W = ∂E/∂A * ∂A/∂W. A is the activations of particular units. In most cases, the derivatives will be zero because W is sparse due to the networks topology. With backprop, you get the learning rules on how to ignore those parts of the gradient.
So, from a mathematical perspective, backprop is not that exciting.

there may be problems which for example make backprop run into local minima. Furthermore, just as an example, you can't adjust the topology with backprop. There are also cool learning methods using nature-inspired metaheuristics (for instance, evolutionary strategies) that enable adjusting weight AND topology (even recurrent ones) simultaneously. Probably, I will add one or more chapters to cover them, too.
There is also a discussion function right on the download page of the manuscript - if you find other hazzles that you don't like about the manuscript, feel free to add them to the page so I can change things in the next edition.
Greetz,
David (Kriesel ;-) )

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart