The big picture of the problem is the following : predicting the failure on engine by predicting temperature of the engine because that is intuitively the main reason of failure. The first thing I want to do is to check which other variables influence the temperature, like torque of the engine, functioning mode of the engine etc. Why? Because those are variables we can change in real life and thus avoid high temperatures, and thus the failure.
So, my question is how to find on which variables the temperature is dependent and how much. As the temperature is dependent on the time we're in the case of a time series problem but not all the variables are dependent on their past values. Therefore, I am not sure that auto-regressive models could work.
First thing that came in mind was to check whether there is a linear relationship. But by reflecting on how physically a temperature will evolve I'm pretty sure it is exponentially, so perhaps just taking the natural logarithm we can transform it into a linear problem and then apply a linear regression. The problem is that won't capture the time dependency of the temperatures. I looked into autoregressive models but I'm not sure they will work. All I want is to see which variables have an impact on temperature not to predict the temperature for now.
Related
I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.
I'm new to machine learning, and I understand that there are parameters and choices that apply to the model you attach to a certain set of inputs, which can be tuned/optimised, but those inputs obviously tie back to fields you generated by slicing and dicing whatever source data you had in a way that makes sense to you. But what if the way you decided to model and cut up your source data, and therefore training data, isn't optimal? Are there ways or tools that extend the power of machine learning into, not only the model, but the way training data was created in the first place?
Say you're analysing the accelerometer, GPS, heartrate and surrounding topography data of someone moving. You want to try determine where this person is likely to become exhausted and stop, assuming they'll continue moving in a straight line based on their trajectory, and that going up any hill will increase heartrate to some point where they must stop. If they're running or walking modifies these things obviously.
So you cut up your data, and feel free to correct how you'd do this, but it's less relevant to the main question:
Slice up raw accelerometer data along X, Y, Z axis for the past A number of seconds into B number of slices to try and profile it, probably applying a CNN to it, to determine if running or walking
Cut up the recent C seconds of raw GPS data into a sequence of D (Lat, Long) pairs, each pair representing the average of E seconds of raw data
Based on the previous sequence, determine speed and trajectory, and determine the upcoming slope, by slicing the next F distance (or seconds, another option to determine, of G) into H number of slices, profiling each, etc...
You get the idea. How do you effectively determine A through H, some of which would completely change the number and behaviour of model inputs? I want to take out any bias I may have about what's right, and let it determine end-to-end. Are there practical solutions to this? Each time it changes the parameters of data creation, go back, re-generate the training data, feed it into the model, train it, tune it, over and over again until you get the best result.
What you call your bias is actually the greatest strength you have. You can include your knowledge of the system. Machine learning, including glorious deep learning is, to put it bluntly, stupid. Although it can figure out features for you, interpretation of these will be difficult.
Also, especially deep learning, has great capacity to memorise (not learn!) patterns, making it easy to overfit to training data. Making machine learning models that generalise well in real world is tough.
In most successful approaches (check against Master Kagglers) people create features. In your case I'd probably want to calculate magnitude and vector of the force. Depending on the type of scenario, I might transform (Lat, Long) into distance from specific point (say, point of origin / activation, or established every 1 minute) or maybe use different coordinate system.
Since your data in time series, I'd probably use something well suited for time series modelling that you can understand and troubleshoot. CNN and such are typically your last resort in majority of cases.
If you really would like to automate it, check e.g. Auto Keras or ludwig. When it comes to learning which features matter most, I'd recommend going with gradient boosting (GBDT).
I'd recommend reading this article from AirBnB that takes deeper dive into journey of building such systems and feature engineering.
So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles.
For instance, I vaguely remember that at some point I heard the following description:
The output (label) of a classification task is discrete and finite while the output (label) of a regression task is continuous and can be infinite
The one word that I am unsure of is infinite for regression in this description.
For instance, if you assume that (for whatever reason) you have 2D data points that are almost distributed like a sine wave (with some noise) and you use polyfit to fit a polynomial of k-degree on it (see Figure here here k = 8). Now you have some data in a specific range, e.g., here the range of available points in the x-direction is [0,12], which is used to fit the polynomial.
However wouldn't you be able to quickly get the y-result for the value x = 1M (or an arbitrarily large number), as you have the general shape of the polynomial? Is that not what infinite labels mean?
Maybe I am just wrongly remembering stuff that I learned years ago ;).
best regards
First of all, this is a question more fitting for the more theoretically inclined sites of StackExchange, like Stats Stackexchange Math Stackexchange, or the Data Science Stackexchange, which conveniently also provide answers to your question.
But not quite. In any case, your problem seems to be on the distinction between input and output. The type of task (i.e. either classificaiton or regression) is solely based on the output of your model, but has nothing to do with the input.
You could have a ton of "continuous input variables" (or even a mixture with distinct ones), and still call it a classification task, if it has a distinct amount of output values.
Furthermore, the infinite simply refers to the fact that these values are not bounded, i.e. you cannot restrict your regression task to a specific range easily. If you suddenly input a value completely outside of your training value range (as with your example), you will likely get an "infinite" y value, since your network will only be trained on this specific range; a problem that also happens with polynomial fitting, as the following example shows:
The red line could be the learned function for your network, so if you suddenly go far beyond known values, you likely get some extreme value (unless you train very well).
Opposed to that, the classification network would still predict any of the given classes. I like to imagine it kind of a Voronoi diagram: Even if your point is arbitrarily far from any of the previous points, it will still belong to some category.
I am pretty new on anomaly detection on time sequence so my question can be obvious for some of you.
Today, I am using lstm and clustering techniques to detect anomalies on time sequences but those method can not identify anomalies that get worse slowly over the time (i think it called trending), i.e temprature of machine increase slowly over the month (lstm will learn this trend and predict the increase without any special error).
There is such a method to detect this kind of faluts?
With time series that is usually what you want: learning gradual change, detecting abrupt change. Otherwise, time plays little role.
You can try e.g. the SigniTrend model with a very slow learning rate (a long half-life time or whatever they called it. Ignore all the tokens, hashing and scalability in that paper, only get the EWMA+EWMVar part which I really like and use it on your time series).
If you set the learning rate really low, the threshold should move slow enough so that your "gradual" change may still be able to trigger them.
Or you ignore time completely. Split your data into a training set (that must not contain anomalies), learn mean and variance on that to find thresholds. Then classify any point outside these thresholds as abnormal (I.e. temperature > mean + 3 * standarddeviation).
As this super naive approach does not learn, it will not follow a drift either. But then time does not play any further role.
I am currently working on coding sensor fusion of a wheel based robot pose from GPS, Lidar, Vision and Vehicle measure. Its model is basic kinematics using EKF and no discrimination against sensors i.e. data comes in based on time stamp.
I have difficulty to fuse those sensors due to following issue;
Sometimes when the latest incoming data comes in from different sensor from a sensor gave previous state, the latest pose of the robot comes in behind previous pose. Therefore data fusion does not get so smooth and zigzag-ed as a result.
I would like discard data which plots behind/backwards of the previous data and take data which poses always forward/ahead of previous state even when sensor to provide the data changes between timestamp t and timestamp t+1. Since the data frame is global frame, it is impossible to rely on its x coordinate in minus to achieve this.
Please let me know if you had some idea on this. Thank you so much in advance.
Best,
Preliminary warning
Let me slip here a warning before suggesting posible solutions to your problem: be careful with discarding data based on your current estimate, since you never know if last measure is "pulling pose back" or previous one was wrong and caused your estimate to move forward too much.
Posible solutions
In a Kalman-like filter, observations are assumed to provide independent, uncorrelated information about state vector variables. These observations are assumed to have a random error distributed as a zero mean gaussian variable. Real life is harder, though :-(
Sometimes, measures are affected by a "bias" (a fixed term, similar to the gaussian error having a non-zero mean). e.g. tropospheric perturbations are known to introduce a position error in GPS fixes that drifts slowly over time.
If you take several sensors observing the same variable, as GPS and Lidar for for position, but they have different biases, your estimation will be jumping back and forth. Scaling problems can have a similar effect.
I will assume this is the root of your problem. If not, please refine your question.
How can you mitigate this problem? I see several alternatives:
Introduce a bias/scale correction term in your state vector to compensate sensor bias/drift. This is a very common trick in EKFs for inertial sensor fusion (gyro/accelerometer), that can work nice when tuned properly.
Apply some preprocessing to sensory inputs to correct known problems. It can be difficult to tune a filter for estimating state vector and sensor parameters at the same time.
Change how observations are interpreted. For example, use difference between consecutive position observations so that you are creating a fake odometer sensor. This greatly reduces the drift problem.
Post-process your output. Instead of discarding observations, integrate them and keep the "jumping" state vector internally, but smooth the output vector to eliminate the jumps. This is done in some UAV autopilots because such jumps affect the performance of PID controllers.
Finally, the most obvious and simple approach: discard observations based on some statistical test. A chi-square test of the residual can be used to determine if an observation is too far from expected values and must be discarded. Be careful with this options, though: observation rejection schemes must be completed with a state vector reinitialization logic to resutl in a stable behavior.
Almost all these solutions require knowning the source of each observation, so you would no longer be able to treat them indistinctly.