removing bias/noise from gappy signal - signal-processing

I have an array of soil water content sensors across several desert field sites. Their signals contain a lot of noise or bias (depending on who I talk to). I want to remove the junk while keeping as much of the signal as possible. I'm not a signal processing guy, so anything along the lines of "use an XYZ filter" or a particular algorithm or something would really help me.
I've posted a plot showing a year's worth of data from one probe. The signal is the "top"; all the junk is below the signal:
http://www.unm.edu/~hilton/swc.png
I've played around with lowess smoothing a lot; that works reasonably well except in places where there's a lot of bias below the signal (like roughly idx 1000 to 2000 and 15000 to 16000 in the example below).
I have access to Matlab's signal processing toolbox and I'm very comfortable in R and python; if there's a pre-packaged filter in one of those I could jump off from that would be great (but I'm open to coding something new).
Many thanks,
Tim

I'd start with a median filter. If I read your plot correctly you're sampling twice an hour and the data isn't too dynamic. Assuming that's correct, a median filter length of 47 or 49 would equate to a one-day window. In this data set you could probably crank that up to a week or more. In any case you should plot the unfiltered and filtered data on top of each other to make sure the filtered data passes the eyeball test (you'll know it when you see it). You may need to do the final clean-up by hand (hope you don't have thousands of sensors).
(Also, I'd send an intern or grad student out to the field sites to find out what's wrong with sensors and fix them.)

It might be worth a quick try to implement some standard deviation filtering of your data set. Split your data up into N segments and for each segment, calculate the standard deviation for the Y-values. Once you've got that, filter out data points that have Y-values that exceed 3 standard deviations (or however much you want). Of course, there is some manual work that goes on with figuring out exactly how many segments to use.

Related

How can I split EEG evoked potentials in different frequency bands using MNE-python?

So far, I have calculated the evoked potentials. However, I would like to see if there is relatively more activity in the theta band wrt the other bands. When I use mne.Evoked.filter, I get a plot which lookes a lot like a sine wave, containing no useful information. Furthermore, the edge regions (time goes from -0.2s to 1s) are highly distorted.
Filtering will always result in edge artifacts, especially for low frequencies like theta (longer filter). To perform analyses on low frequency signal you should epoch your data into longer segments (epochs) than the time period you are interested in.
Also, if you are interested in theta oscillations it is better to perform time-frequency analysis than filter the ERP. ERP contains only time-locked activity, while with time-frequency representation you will be able to see theta even in time periods where it was not phase-aligned across trials. You may want to follow this tutorial for example.
Also make sure to see the many rich tutorials and examples in mne docs.
If you have any further problems we use Discourse now: https://mne.discourse.group/

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

Is it recommended to combine multiple "related" features into one vector in the scenario below?

I am a beginner/novice at "practical" machine learning.
I have compiled a very large data set to create a binary classification machine learning model. The data set has over 80 columns but I'm trying to shrink that column list down. I've run the data through multiple algorithms (Decition Tree, Random Forest, Gradient Boosting); used various hyper-parameter tuning; and analyzed multiple permutation feature importance (PFI) results to see what features need to be removed. So far, my accuracy (and other metrics such as F1-score, precision, recall) is hovering anywhere between 70 and 80%. My question is this:
If I have a subset of 2-4 columns whose data is not only related, but dependent on each other
i.e.
- colA won't make much sense without also looking and using colB, colC, etc
- colA won't make much sense without adding/subtracting/dividing with colB
Is it possible/recommended to combine these few columns into a vector or another feature?
For example, colA plotted as a time series would make a nice non-linear curved line. colB plotted as a time series would also make a nice non-linear curved line. However, looking at each of these lines won't make much sense until you look at where they intersect (which happens again and again). So you can see here that the distance between any two points (colA, colB) is really important.
BUT BUT when I include a colC which is the result of difference between colA and colC, the PFI analysis kicks colC back as a bad feature that lowers accuracy, etc.
Any help with this is greatly appreciated and thank you all in advance for your help.
If you need me to provide any more info/example, let me know. Thanks again.
I have no idea how your data looks. Might want to see some example/matrise for people to answer better :)
If you are sure you want to remove features , have you tried Lasso? With L1 regularization some features gets completely ignored . set alpha low and see what happenes?
Ex: lasso001 = Lasso(alpha=0.01, max_iter= 10000).fit(X_train, y_train)....and so on.
print output with n of features . If you want it to use more features, set alpha lower. just be aware of overfitting...
you can also "tune" the parameter in LogisticRegression and LinearSVM. ( C).

Applying machine learning to training data parameters

I'm new to machine learning, and I understand that there are parameters and choices that apply to the model you attach to a certain set of inputs, which can be tuned/optimised, but those inputs obviously tie back to fields you generated by slicing and dicing whatever source data you had in a way that makes sense to you. But what if the way you decided to model and cut up your source data, and therefore training data, isn't optimal? Are there ways or tools that extend the power of machine learning into, not only the model, but the way training data was created in the first place?
Say you're analysing the accelerometer, GPS, heartrate and surrounding topography data of someone moving. You want to try determine where this person is likely to become exhausted and stop, assuming they'll continue moving in a straight line based on their trajectory, and that going up any hill will increase heartrate to some point where they must stop. If they're running or walking modifies these things obviously.
So you cut up your data, and feel free to correct how you'd do this, but it's less relevant to the main question:
Slice up raw accelerometer data along X, Y, Z axis for the past A number of seconds into B number of slices to try and profile it, probably applying a CNN to it, to determine if running or walking
Cut up the recent C seconds of raw GPS data into a sequence of D (Lat, Long) pairs, each pair representing the average of E seconds of raw data
Based on the previous sequence, determine speed and trajectory, and determine the upcoming slope, by slicing the next F distance (or seconds, another option to determine, of G) into H number of slices, profiling each, etc...
You get the idea. How do you effectively determine A through H, some of which would completely change the number and behaviour of model inputs? I want to take out any bias I may have about what's right, and let it determine end-to-end. Are there practical solutions to this? Each time it changes the parameters of data creation, go back, re-generate the training data, feed it into the model, train it, tune it, over and over again until you get the best result.
What you call your bias is actually the greatest strength you have. You can include your knowledge of the system. Machine learning, including glorious deep learning is, to put it bluntly, stupid. Although it can figure out features for you, interpretation of these will be difficult.
Also, especially deep learning, has great capacity to memorise (not learn!) patterns, making it easy to overfit to training data. Making machine learning models that generalise well in real world is tough.
In most successful approaches (check against Master Kagglers) people create features. In your case I'd probably want to calculate magnitude and vector of the force. Depending on the type of scenario, I might transform (Lat, Long) into distance from specific point (say, point of origin / activation, or established every 1 minute) or maybe use different coordinate system.
Since your data in time series, I'd probably use something well suited for time series modelling that you can understand and troubleshoot. CNN and such are typically your last resort in majority of cases.
If you really would like to automate it, check e.g. Auto Keras or ludwig. When it comes to learning which features matter most, I'd recommend going with gradient boosting (GBDT).
I'd recommend reading this article from AirBnB that takes deeper dive into journey of building such systems and feature engineering.

Simulink - Finding index of vector element where accumulation crosses a threshold

I'm looking to improve the delay estimation portion of a Simulink model. The input is an estimated impulse response for the system. I want the index of the first sample of the impulse response where the sum of the absolute values of it and the previous elements exceeeds a certain fraction of the total across the whole vector.
Here's my current solution:
The matrix sum runs along dimension 2. The prelookup block is set to clip. This is finding the element (possibly one off, I haven't thought that through yet) where 1% of the total is reached.
This seems overly complicated, and it isn't clear what it is trying to do without some explanation. I tried coming up with a solution based on the discrete integrator/accumulator block but couldn't come up with something better. It certainly does a lot more addition than it needs to with this solution, although performance isn't really an issue right now.
Is there a simpler way to get the running sum across a vector that I could put in place of the Toeplitz->Triangular->Sum section? Is there a better way overall to perform the whole lookup?
If you have DSP System toolbox, there is a "Cumulative Sum" block which should be able to replace your toeplitz, traiangular matrix and matrix sum.
http://www.mathworks.com/help/dsp/ref/cumulativesum.html
If you do not have DSP System toolbox, I suggest coding this in MATLAB Function block where it should be a one liner.
y = cumsum(x);
While you are there you may also want to code the entire logic in MATLAB Function block which in cases like this is easier to code and understand.

Resources