Edge Effect for Linear Regression - time-series

I am wondering if there is a accepted approach to deal with the problem I am having.
I have variables t,dt where t is time and dt is the time between t and some future event. I want to use linear regression to measure the local slope at the large end of t.
The problem is that at the extreme end of t there are only small values for dt because that future event is still in the future. This makes the slope smaller than it should be or even the wrong sign.
Edit: I "fixed" this by using t+dt as my time variable and that seems to work in some tests I ran, but I was wondering about a standard approach.
Blue is example data and red is the "true" linear fit while black is the linear fit found looking at t in (15,20)

Related

which power of the feature should i train with? regression

I have a explanatory variable x and a response variable y. I am trying to find which power of the feature i should train with. You can ignore the colors for my question. the scatter data is from the sensor and the line plot is the theoretical curve from the lab, which you can also ignore for my question.
For this answer I understand you want to obtain some polynomial curve going through the croissant shaped zone where points are dense.
Also I assume that the independent variable is on the horizontal axis, while the dependent is on the vertical one. Otherwise as you can see from the blue line, there is no functional that could give you this.
Now to select the degree of polynomial you can use stepwise regression.
This is about running the regression with more or less features one at a time (i.e decrease or increase the degree of polynomial in this case), and calculating a score such as AIC, BIC, or even adjusted R2 to assess if it's worth it or not to add or remove this feature.

How to improve Levenberg-Marquardt's method for polynomial curve fitting?

Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.
Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).

DONUT- Anomaly detection Algorithm ignores the relationship between sliding windows?

I'm trying to understand the paper : https://netman.aiops.org/wp-content/uploads/2018/05/PID5338621.pdf about Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection.
Clustering is done using ROCKA algorithm.
Steps:
1.) Preprocessing is conducted on the raw KPI data to remove amplitude differences and standardize data.
2.) In baseline extraction step, we reduce noises, remove the extreme values (which are likely anomalies), and extract underlying shapes, referred to as baselines, of KPIs. It's done by applying moving average with a small sliding window.
3.) Clustering is then conducted on the baselines of sampled KPIs, with robustness against phase shifts and noises.
4.) Finally, we calculate the centroid of each cluster, then assign the unlabeled KPIs by their distances to these centroids.
I understand ROCKA mechanism.
Now, i'm trying to understand DONUT algorithm which is applied for "Anomaly Detection".
How it works is :
DONUT applies sliding windows over the KPI to get short series x and tries to recognize what normal patterns x follows. The indicator is then calculated by the difference between reconstructed normal patterns and x to show the severity of anomalies. In practice, a threshold should be selected for each KPI. A data point with an indicator value larger than the threshold is regarded as an anomaly.
Now my question is :
IT seems like DONUT is not robust enough against time information related anomalies. Meaning that it works on a set of sliding windows and it ignores the relationship between windows. So the window becomes a very critical parameter here. So it might generate high false positives. What I'm understanding wrong here?
Please help and make me understand how DONUT will capture the relationship between sliding windows.

Regarding to backward of convolution layer in Deep learning

I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".

Convergence and regularization in linear regression classifier

I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
Convergence
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?

Resources