kalman filter with redundant state measurements - orientation

I am trying to implement a kalman filter for orientation detection. just like most other implementations I found online, I will be using a gyro and accelerometer to measure the pitch and roll, however I intend to also add horizon detection. This will give me a second reading for the pitch and roll. This means that I will have two means of measuring the current state, accelerometer and horizon detection whilst the gyro will be used for control.
So far I have implemented the filter on the sensor data and horizon detection separately based on this tutorial: http://blog.tkjelectronics.dk/2012/09/a-practical-approach-to-kalman-filter-and-how-to-implement-it/
Which part of the kalman filter do I have to modify for the algorithm to choose the best reading between the predicted state, accelerometer reading and horizon detected reading? Any help, links to papers or sites will be appreciated
thanks in advance for your help

The KF consists of two parallel components:
1. the estimated state, AND
2. the uncertainty in that estimate (specifically, the covariance matrix of the state components).
When combining 2 estimates of the state, the standard method takes a weighted average, with the weights being the inverses of the (co)variances. That is, the more certain of the 2 estimates (smaller covariance) is weighted more highly than the other.
So if you are not already tracking the covariances for your 2 estimates, you will need to do this.
For a scalar state X with 2 estimates X' and X", you would have a variance for each estimate: V' and V", with inverses C' = 1/V' and C" = 1/V". (The "certainties" C are easier to use than the variances V.)
Then the MMSE estimate (which is what the KF attempts to optimize) of the state is given by: Xmmse = (X'/V' + X"/V") / (1/V' + 1/V").
[There is also a corresponding update to V at this point, based on V' and V".]
For vectorial states, V will be replaces by a covariance matrix, and the divisions will become matrix inverses. In this case, it may be easier to track the inverses C directly: Xmmse = (C' + C") \ (C' * X' + C" * X") [and corresponding update to C],
where "\" denotes pre-multiplication by the inverse of the first factor, C'+C".
I hope this helps.
[I apologize for the poor formatting here. Stack Overflow has inferred that the algebraic expressions are code, and demanded formatting them as code before it would let me answer. They aren't, so I couldn't.]

Related

Is there a fundamental difference between DSP for Audio Signal Processing and Sensor Signal Processing?

Audio is made up of multiple frequencies occurring at any given time, and we can perform the FFT to get the Frequency bins, but what does the concept of Frequency mean when it comes to Sensor data?
For example, a Triaxial Accelerometer somehow converts a voltage signal and produces acceleration readings in ms^-2. Is the FFT performed with those X,Y,Z readings or the voltages sampled at Fs.
Am I overcomplicating things or is there a difference in mindset when performing DSP for Audio vs Sensor data?
A Fourier transform is tool to convert functions or signals into something that is easier to work with. It is a mathematical tool. The result can have an easy physical interpretation but that is not always the case.
Assume you have an object with constant mass and several periodic sin-like forces F_1*sin(c*t), F_2*sin(d*t), ... that act on the object. The total force is just the sum of those forces:
F(t) = F_1*sin(c*t) + F_2*sin(d*t) + ...
You get the particle's acceleration using Newton's 2nd law:
m * a(t) = F(t)
=> a(t) = F(t) / m = F_1/m * sin(c*t) + F_2/m * sin(d*t) + ...
Let's assume you measured a(t) but don't know the equation above. It you perform a Fourier transformation you can calculate the values of F_1/m, F_2/m, ... . That means your Fourier transform of the the acceleration is the amplitude of the force at the given frequency over the object's mass.
This interpretation works because the Fourier transform is linear and so is the adding of forces (See Newtons 2nd law). If you describe something non-linear chances are that there is no easy interpretation of the result of the transformation.
So when do you perform the FFT? It depends:
If you do it to improve you signal (remove noise) do it on the measured data.
If you want to analyse the physical object (resonances) do it on the acceleration.
(If the conversion is linear (ADC output to m/s^2 is a simple multiplication) it does not matter.)
I hope this makes things a bit clearer (at least from the physical point of view).

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Discrete Wavelet Transform (Daubechies wavelet) of an array complex numbers

Say, I have a signal represented as an array of real numbers y = [1,2,0,4,5,6,7,90,5,6]. I can use Daubechies-4 coefficients D4 = [0.482962, 0.836516, 0.224143, -0.129409], and apply a wavelet transform to receive high- and low-frequencies of the signal. So, the high frequency component will be calculated like this:
high[v] = y[2*v]*D4[0] + y[2*v+1]*D4[1] + y[2*v+2]*D4[2] + y[2*v+3]*D4[3],
and the low frequency component can be calculated using other D4 coefs permutation.
The question is: what if y is complex array? Do I just multiply and add complex numbers to receive subbands, or is it correct to get amplitude and phase, treat each of them like a real number, do the wavelet transform for them, and then restore complex number array for each subband using formulas real_part = abs * cos(phase) and imaginary_part = abs * sin(phase)?
To handle the case of complex data, you're looking at the Complex Wavelet Transform. It's actually a simple extension to the DWT. The most common way to handle complex data is to treat the real and imaginary components as two separate signals and perform a DWT on each component separately. You will then receive the decomposition of the real and imaginary components.
This is commonly known as the Dual-Tree Complex Wavelet Transform. This can best be described by the figure below that I pulled from Wikipedia:
Source: Wikipedia
It's called "dual-tree" because you have two DWT decompositions happening in parallel - one for the real component and one for the imaginary. In the above diagram, g0/h0 represent the low-pass and high-pass components of the real part of the signal x and g1/h1 represent the low-pass and high-pass components of the imaginary part of the signal x.
Once you decompose the real and imaginary parts into their respective DWT decompositions, you can combine them to get the magnitude and/or phase and proceed to the next step or whatever you desire to do with them.
The mathematical proof regarding the correctness of this is outside the scope of what we're talking about, but if you would like to see how this got derived, I refer you to the canonical paper by Kingsbury in 1997 in the work Image Processing with Complex Wavelets - http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=835E60EAF8B1BE4DB34C77FEE9BBBD56?doi=10.1.1.55.3189&rep=rep1&type=pdf. Pay close attention to the noise filtering of images using the CWT - this is probably what you're looking for.

How to use Odds ratio feature selection with Naive bayes Classifier

I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Resources