I am pretty new on anomaly detection on time sequence so my question can be obvious for some of you.
Today, I am using lstm and clustering techniques to detect anomalies on time sequences but those method can not identify anomalies that get worse slowly over the time (i think it called trending), i.e temprature of machine increase slowly over the month (lstm will learn this trend and predict the increase without any special error).
There is such a method to detect this kind of faluts?
With time series that is usually what you want: learning gradual change, detecting abrupt change. Otherwise, time plays little role.
You can try e.g. the SigniTrend model with a very slow learning rate (a long half-life time or whatever they called it. Ignore all the tokens, hashing and scalability in that paper, only get the EWMA+EWMVar part which I really like and use it on your time series).
If you set the learning rate really low, the threshold should move slow enough so that your "gradual" change may still be able to trigger them.
Or you ignore time completely. Split your data into a training set (that must not contain anomalies), learn mean and variance on that to find thresholds. Then classify any point outside these thresholds as abnormal (I.e. temperature > mean + 3 * standarddeviation).
As this super naive approach does not learn, it will not follow a drift either. But then time does not play any further role.
Related
The big picture of the problem is the following : predicting the failure on engine by predicting temperature of the engine because that is intuitively the main reason of failure. The first thing I want to do is to check which other variables influence the temperature, like torque of the engine, functioning mode of the engine etc. Why? Because those are variables we can change in real life and thus avoid high temperatures, and thus the failure.
So, my question is how to find on which variables the temperature is dependent and how much. As the temperature is dependent on the time we're in the case of a time series problem but not all the variables are dependent on their past values. Therefore, I am not sure that auto-regressive models could work.
First thing that came in mind was to check whether there is a linear relationship. But by reflecting on how physically a temperature will evolve I'm pretty sure it is exponentially, so perhaps just taking the natural logarithm we can transform it into a linear problem and then apply a linear regression. The problem is that won't capture the time dependency of the temperatures. I looked into autoregressive models but I'm not sure they will work. All I want is to see which variables have an impact on temperature not to predict the temperature for now.
I'm working on my idea for Master thesis topic.
I get a dataset with milions of records which describe on-street parking sensors.
Data i have :
-vehicle present on particular sensor ( true or false)
It's normal that there are few parking event where there are False values with different duration time in a row.
-arrival time and departure time(month,day,hour,minute and even second)
-duration in minutes
And few more columns, but i don't have any idea how to show in my analysis that "continuity of time" and
reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
Any ideas?
You can take two approaches:
If you want to predict whether a particular space will be occupied or not and if you take in count order of the events (TIME), this seems like a time series problem. You should start by trying simple time-series algorithms like Moving average or ARIMA Models. There are more sophisticated methods that take in count long and short term relationships, like recurrent neural networks, especially LSTM (Long short-term memory) which have shown good performance in time series problems.
You can take in the count all variables and use them to train a clustering algorithm like K-means or SVM.
As you pointed out:
And few more columns, but I don't have any idea how to show in my analysis that "continuity of time" and reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
I recommend you to work this problem as a time series problem.
Timeseries modeling will be better option for this kind of modelling. As you said you want to predict binary output at different time intervals i.e whether the the parking slot will be occupied at the particular time interval or not. You can use LSTM for this purpose.
Time series is definitely an option here... if you are really going with LSTMs why not look into Transformers and take advantage of attention mechanism while doing time series forecasting !! I don't know them thoroughly, yet, just have a vague idea and performance benefits over RNNs and LSTM.
I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.
I'm new to machine learning, and I understand that there are parameters and choices that apply to the model you attach to a certain set of inputs, which can be tuned/optimised, but those inputs obviously tie back to fields you generated by slicing and dicing whatever source data you had in a way that makes sense to you. But what if the way you decided to model and cut up your source data, and therefore training data, isn't optimal? Are there ways or tools that extend the power of machine learning into, not only the model, but the way training data was created in the first place?
Say you're analysing the accelerometer, GPS, heartrate and surrounding topography data of someone moving. You want to try determine where this person is likely to become exhausted and stop, assuming they'll continue moving in a straight line based on their trajectory, and that going up any hill will increase heartrate to some point where they must stop. If they're running or walking modifies these things obviously.
So you cut up your data, and feel free to correct how you'd do this, but it's less relevant to the main question:
Slice up raw accelerometer data along X, Y, Z axis for the past A number of seconds into B number of slices to try and profile it, probably applying a CNN to it, to determine if running or walking
Cut up the recent C seconds of raw GPS data into a sequence of D (Lat, Long) pairs, each pair representing the average of E seconds of raw data
Based on the previous sequence, determine speed and trajectory, and determine the upcoming slope, by slicing the next F distance (or seconds, another option to determine, of G) into H number of slices, profiling each, etc...
You get the idea. How do you effectively determine A through H, some of which would completely change the number and behaviour of model inputs? I want to take out any bias I may have about what's right, and let it determine end-to-end. Are there practical solutions to this? Each time it changes the parameters of data creation, go back, re-generate the training data, feed it into the model, train it, tune it, over and over again until you get the best result.
What you call your bias is actually the greatest strength you have. You can include your knowledge of the system. Machine learning, including glorious deep learning is, to put it bluntly, stupid. Although it can figure out features for you, interpretation of these will be difficult.
Also, especially deep learning, has great capacity to memorise (not learn!) patterns, making it easy to overfit to training data. Making machine learning models that generalise well in real world is tough.
In most successful approaches (check against Master Kagglers) people create features. In your case I'd probably want to calculate magnitude and vector of the force. Depending on the type of scenario, I might transform (Lat, Long) into distance from specific point (say, point of origin / activation, or established every 1 minute) or maybe use different coordinate system.
Since your data in time series, I'd probably use something well suited for time series modelling that you can understand and troubleshoot. CNN and such are typically your last resort in majority of cases.
If you really would like to automate it, check e.g. Auto Keras or ludwig. When it comes to learning which features matter most, I'd recommend going with gradient boosting (GBDT).
I'd recommend reading this article from AirBnB that takes deeper dive into journey of building such systems and feature engineering.
I want my prophet model to predict values for every 10 minute interval over the next 24h (e.g. 24*6=144 values).
Let's say I've trained a model on a huge (over 900k of rows) .csv file where sample row is
...
ds=2018-04-24 16:10, y=10
ds=2018-04-24 16:20, y=14
ds=2018-04-24 16:30, y=12
...
So I call mode.fit(huge_df) and wait for 1-2 seconds to receive 144 values.
And then an hour passes and I want to tune my prediction for the following (144 - 6) 138 values given a new data (6 rows).
How can I tune my existing prophet model without having to call mode.fit(huge_df + live_df) and wait for some seconds again? I'd like to be able to call mode.tune(live_df) and get an instant prediction.
As far as I'm aware this is not really a possibility. I think they use a variant of the BFGS optimization algorithm to maximize the posterior probability of the the models. So as I see it the only way to train the model is to take into account the whole dataset you want to use. The reason why transfer learning works with neural networks is that it is just a weight (parameter) initialization and back propagation is then run iteratively in the standard SGD training schema. Theoretically you could initialize the parameters to the ones of the previous model in the case of prophet, which might or might not work as expected. I'm however not aware that something of the likes is currently implemented (but since its open-source you could give it a shot, hopefully reducing convergence times quite a bit).
Now as far as practical advice goes. You probably don't need all the data, just tail it to what you really need for the problem at hand. For instance it does not make sense to have 10 years of data if you have only monthly seasonality. Also depending on how strongly your data is autocorrelated, you may downsample a bit without loosing any predictive power. Another idea would be to try an algorithm that is suitable for online-learning (or batch) - You could for instance try a CNN with dilated convolution.
Time Series problems are quite different from usual Machine Learning Problems. When we are training cat/dog classifier, the feature set of cats and dogs are not going to change instantly (evolution is slow). But when it comes to the time series problems, training should happen, every time prior to forecasting. This becomes even more important when you are doing univariate forecasting (as is your case), as only feature we're providing to the model is the past values and these value will change at every instance. Because of these concerns, I don't think something like transfer learning will work in time series.
Instead what you can do is, try converting your time series problem into the regression problem by use of rolling windowing approach. Then, you can save that model and get you predictions. But, make sure to train it again and again in short intervals of time, like once a day or so, depending upon how frequently you need a forecast.