Select an integer number of periods - signal-processing

Suppose we have sinusoidal with frequency 100Hz and sampling frequency of 1000Hz. It means that our signal has 100 periods in a second and we are taking 1000 samples in a second. Therefore, in order to select a complete period I'll have to take fs/f=10 samples. Right?
What if the sampling period is not a multiple of the frequency of the signal (like 550Hz)? Do I have to find the minimum multiple M of f and fs, and than take M samples?
My goal is to select an integer number of periods in order to be able to replicate them without changes.

You have f periods a second, and fs samples a second.
If you take M samples, it would cover M/fs part of a second, or P = f * (M/fs) periods. You want this number to be integer.
So you need to take M = fs / gcd(f, fs) samples.
For your example P = 1000 / gcd(100, 1000) = 1000 / 100 = 10.
If you have 60 Hz frequency and 80 Hz sampling frequency, it gives P = 80 / gcd(60, 80) = 80 / 20 = 4 -- 4 samples will cover 4 * 1/80 = 1/20 part of a second, and that will be 3 periods.
If you have 113 Hz frequency and 512 Hz sampling frequency, you are out of luck, since gcd(113, 512) = 1 and you'll need 512 samples, covering the whole second and 113 periods.

In general, an arbitrary frequency will not have an integer number of periods. Irrational frequencies will never even repeat ever. So some means other than concatenation of buffers one period in length will be needed to synthesize exactly periodic waveforms of arbitrary frequencies. Approximation by interpolation for fractional phase offsets is one possibility.

Related

Misconceptions about the Shannon-Nyquist theorem

I am a student working with time-series data which we feed into a neural network for classification (my task is to build and train this NN).
We're told to use a band-pass filter of 10 Hz to 150 Hz since anything outside that is not interesting.
After applying the band-pass, I've also down-sampled the data to 300 samples per second (originally it was 768 Hz). My understanding of the Shannon Nyquist sampling theorem is that, after applying the band-pass, any information in the data will be perfectly preserved at this sample-rate.
However, I got into a discussion with my supervisor who claimed that 300 Hz might not be sufficient even if the signal was band-limited. She says that it is only the minimum sample rate, not necessarily the best sample rate.
My understanding of the sampling theorem makes me think the supervisor is obviously wrong, but I don't want to argue with my supervisor, especially in case I'm actually the one who has misunderstood.
Can anyone help to confirm my understanding or provide some clarification? And how should I take this up with my supervisor (if at all).
The Nyquist-Shannon theorem states that the sampling frequency should at-least be twice of bandwidth, i.e.,
fs > 2B
So, this is the minimal criteria. If the sampling frequency is less than 2B then there will be aliasing. There is no upper limit on sampling frequency, but more the sampling frequency, the better will be the reconstruction.
So, I think your supervisor is right in saying that it is the minimal condition and not the best one.
Actually, you and your supervisor are both wrong. The minimum sampling rate required to faithfully represent a real-valued time series whose spectrum lies between 10 Hz and 150 Hz is 140 Hz, not 300 Hz. I'll explain this, and then I'll explain some of the context that shows why you might want to "oversample", as it is referred to (spoiler alert: Bailian-Low Theorem). The supervisor is mixing folklore into the discussion, and when folklore is not properly-contexted, it tends to telephone tag into fakelore. (That's a common failing even in the peer-reviewed literature, by the way). And there's a lot of fakelore, here, that needs to be defogged.
For the following, I will use the following conventions.
There's no math layout on Stack Overflow (except what we already have with UTF-8), so ...
a^b denotes a raised to the power b.
∫_I (⋯x⋯) dx denotes an integral of (⋯x⋯) taken over all x ∈ I, with the default I = ℝ.
The support supp φ (or supp_x φ(x) to make the "x" explicit) of a function φ(x) is the smallest closed set containing all the x-es for which φ(x) ≠ 0. For regularly-behaving (e.g. continuously differentiable) functions that means a union of closed intervals and/or half-rays or the whole real line, itself. This figures centrally in the Shannon-Nyquist sampling theorem, as its main condition is that a spectrum have bounded support; i.e. a "finite bandwidth".
For the Fourier transform I will use the version that has the 2π up in the exponent, and for added convenience, I will use the convention 1^x = e^{2πix} = cos(2πx) + i sin(2πx) (which I refer to as the Ramanujan Convention, as it is the convention I frequently used in my previous life oops I mean which Ramanujan secretly used in his life to make the math a whole lot simpler).
The set ℤ = {⋯, -2, -1, 0, +1, +2, ⋯ } is the integers, and 1^{x+z} = 1^x for all z∈ℤ - making 1^x the archetype of a periodic function whose period is 1.
Thus, the Fourier transform f̂(ν) of a function f(t) and its inverse are given by:
f̂(ν) = ∫ f(t) 1^{-νt} dt, f(t) = ∫ f̂(ν) 1^{+νt} dν.
The spectrum of the time series given by the function f(t) is the function f̂(ν) of the cyclic frequency ν, which is what is measured in Hertz (Hz.); t, itself, being measured in seconds. A common convention is to use the angular frequency ω = 2πν, instead, but that muddies the picture.
The most important example, with respect to the issue at hand, is the Fourier transform χ̂_Ω of the interval function given by χ_Ω(t) = 1 if t ∈ [-½Ω,+½Ω] and χ_Ω(t) = 0 else:
χ̂_Ω(t) = ∫_[-½Ω,+½Ω] 1^ν dν
= {1^{+½Ω} - 1^{-½Ω}}/{2πi}
= {2i sin πΩ}/{2πi}
= Ω sinc πΩ
which is where the function sinc x = (sin πx)/(πx) comes into play.
The cardinal form of the sampling theorem is that a function f(t) can be sampled over an equally-spaced sampled domain T ≡ { kΔt: k ∈ ℤ }, if its spectrum is bounded by supp f̂ ⊆ [-½Ω,+½Ω] ⊆ [-1/(2Δt),+1/(2Δt)], with the sampling given as
f(t) = ∑_{t'∈T} f(t') Ω sinc(Ω(t - t')) Δt.
So, this generally applies to [over-]sampling with redundancy factors 1/(ΩΔt) ≥ 1. In the special case where the sampling is tight with ΩΔt = 1, then it reduces to the form
f(t) = ∑_{t'∈T} f(t') sinc({t - t'}/Δt).
In our case, supp f̂ = [10 Hz., 150 Hz.] so the tightest fits are with 1/Δt = Ω = 300 Hz.
This generalizes to equally-spaced sampled domains of the form T ≡ { t₀ + kΔt: k ∈ ℤ } without any modification.
But it also generalizes to frequency intervals supp f̂ = [ν₋,ν₊] of width Ω = ν₊ - ν₋ and center ν₀ = ½ (ν₋ + ν₊) to the following form:
f(t) = ∑_{t'∈T} f(t') 1^{ν₀(t - t')} Ω sinc(Ω(t - t')) Δt.
In your case, you have ν₋ = 10 Hz., ν₊ = 150 Hz., Ω = 140 Hz., ν₀ = 80 Hz. with the condition Δt ≤ 1/140 second, a sampling rate of at least 140 Hz. with
f(t) = (140 Δt) ∑_{t'∈T} f(t') 1^{80(t - t')} sinc(140(t - t')).
where t and Δt are in seconds.
There is a larger context to all of this. One of the main places where this can be used is for transforms devised from an overlapping set of windowed filters in the frequency domain - a typical case in point being transforms for the time-scale plane, like the S-transform or the continuous wavelet transform.
Since you want the filters to be smoothly-windowed functions, without sharp corners, then in order for them to provide a complete set that adds up to a finite non-zero value over all of the frequency spectrum (so that they can all be normalized, in tandem, by dividing out by this sum), then their respective supports have to overlap.
(Edit: Generalized this example to cover both equally-spaced and logarithmic-spaced intervals.)
One example of such a set would be filters that have end-point frequencies taken from the set
Π = { p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α: n ∈ {0,1,2,⋯} }
So, for interval n (counting from n = 0), you would have ν₋ = p_n and ν₊ = p_{n+1}, where the members of Π are enumerated
p_n = p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α,
Δp_n = p_{n+1} - p_n = α p_n + β = (α p₀ + β)(α + 1)ⁿ,
n ∈ {0,1,2,⋯}
The center frequency of interval n would then be ν₀ = p_n + ½ Δp₀ (α + 1)ⁿ and the width would be Ω = Δp₀ (α + 1)ⁿ, but the actual support for the filter would overlap into a good part of the neighboring intervals, so that when you add up the filters that cover a given frequency ν the sum doesn't drop down to 0 as ν approaches any of the boundary points. (In the limiting case α → 0, this produces an equally-spaced frequency domain, suitable for an equalizer, while in the case β → 0, it produces a logarithmic scale with base α + 1, where octaves are equally-spaced.)
The other main place where you may apply this is to time-frequency analysis and spectrograms. Here, the role of a function f and its Fourier transform f̂ are reversed and the role of the frequency bandwidth Ω is now played by the (reciprocal) time bandwidth 1/Ω. You want to break up a time series, given by a function f(t) into overlapping segments f̃(q,λ) = g(λ)* f(q + λ), with smooth windowing given by the functions g(λ) with bounded support supp g ⊆ [-½ 1/Ω, +½ 1/Ω], and with interval spacing Δq much larger than the time sampling Δt (the ratio Δq/Δt is called the "hop" factor). The analogous role of Δt is played, here, by the frequency interval in the spectrogram Δp = Ω, which is now constant.
Edit: (Fixed the numbers for the Audacity example)
The minimum sampling rate for both supp_λ g and supp_λ f(q,λ) is Δq = 1/Ω = 1/Δp, and the corresponding redundancy factor is 1/(ΔpΔq). Audacity, for instance, uses a redundancy factor of 2 for its spectrograms. A typical value for Δp might be 44100/2048 Hz., while the time-sampling rate is Δt = 1/(2×3×5×7)² second (corresponding to 1/Δt = 44100 Hz.). With a redundancy factor of 2, Δq would be 1024/44100 second and the hop factor would be Δq/Δt = 1024.
If you try to fit the sampling windows, in either case, to the actual support of the band-limited (or time-limited) function, then the windows won't overlap and the only way to keep their sum from dropping to 0 on the boundary points would be for the windowing functions to have sharp corners on the boundaries, which would wreak havoc on their corresponding Fourier transforms.
The Balian-Low Theorem makes the actual statement on the matter.
https://encyclopediaofmath.org/wiki/Balian-Low_theorem
And a shout-out to someone I've been talking with, recently, about DSP-related matters and his monograph, which provides an excellent introductory reference to a lot of the issues discussed here.
A Friendly Guide To Wavelets
Gerald Kaiser
Birkhauser 1994
He said it's part of a trilogy, another installment of which is forthcoming.

vowpalwabbit strange features count

I have found that during training my model vw shows very big (much more than my features count ) feature number count in it's log.
I have tried to reproduce it using some small example:
simple.test:
-1 | 1 2 3
1 | 3 4 5
then "vw simple.test" command says that it have used 8 features. +one feature is constant but what are the other ? And in my real exmaple difference between my features and features used in wv is abot x10 more.
....
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = t
num sources = 1
average since example example current current current
loss last counter weight label predict features
finished run
number of examples = 2
weighted example sum = 2
weighted label sum = 3
average loss = 1.9179
best constant = 1.5
total feature number = 8 !!!!
total feature number displays a sum of feature counts from all observed examples. So it's 2*(3+1 constant)=8 in your case. The number of features in current example is displayed in current features column. Note that only 2^Nth example is printed on screen by default. In general observations can have unequal number of features.

FFT normalization

I know this question has been asked ad nauseam but somehow I can't make it work properly. I created a single, sine wave of 440 Hz having a unit amplitude. Now, after the FFT, the bin at 440 Hz has a distinct peak but the value just isn't right. I'd expect to see 0 dB since I'm dealing with a unit amplitude sine wave. Instead, the power calculated is well above 0 dB. The formula I'm using is simply
for (int i = 0; i < N/2; i++)
{
mag = sqrt((Real[i]*Real[i] + Img[i]*Img[i])/(N*0.54)); //0.54 correction for a Hamming Window
Mag[i] = 10 * log(mag) ;
}
I should probably point out that the total energy in the time domain is equal to the energy in the frequency domain (Parseval's theorem), so I know that my FFT routine is fine.
Any help is much appreciated.
I've been struggling with this again for work. It seems that a lot of software routines / books are a bit sloppy on the normalization of the FFT.
The best summary I have is: Energy needs to be conserved - which is Parseval's theorem. Also when coding this in Python, you can easily loose an element and not know it. Note that numpy.arrays indexing is not inclusive of the last element.
a = [1,2,3,4,5,6]
a[1:-1] = [2,3,4,5]
a[-1] = 6
Here's my code to normalize the FFT properly:
# FFT normalization to conserve power
import numpy as np
import matplotlib.pyplot as plt
import scipy.signal
sample_rate = 500.0e6
time_step = 1/sample_rate
carrier_freq = 100.0e6
# number of digital samples to simulate
num_samples = 2**18 # 262144
t_stop = num_samples*time_step
t = np.arange(0, t_stop, time_step)
# let the signal be a voltage waveform,
# so there is no zero padding
carrier_I = np.sin(2*np.pi*carrier_freq*t)
#######################################################
# FFT using Welch method
# windows = np.ones(nfft) - no windowing
# if windows = 'hamming', etc.. this function will
# normalize to an equivalent noise bandwidth (ENBW)
#######################################################
nfft = num_samples # fft size same as signal size
f,Pxx_den = scipy.signal.welch(carrier_I, fs = 1/time_step,\
window = np.ones(nfft),\
nperseg = nfft,\
scaling='density')
#######################################################
# FFT comparison
#######################################################
integration_time = nfft*time_step
power_time_domain = sum((np.abs(carrier_I)**2)*time_step)/integration_time
print 'power time domain = %f' % power_time_domain
# Take FFT. Note that the factor of 1/nfft is sometimes omitted in some
# references and software packages.
# By proving Parseval's theorem (conservation of energy) we can find out the
# proper normalization.
signal = carrier_I
xdft = scipy.fftpack.fft(signal, nfft)/nfft
# fft coefficients need to be scaled by fft size
# equivalent to scaling over frequency bins
# total power in frequency domain should equal total power in time domain
power_freq_domain = sum(np.abs(xdft)**2)
print 'power frequency domain = %f' % power_freq_domain
# Energy is conserved
# FFT symmetry
plt.figure(0)
plt.subplot(2,1,1)
plt.plot(np.abs(xdft)) # symmetric in amplitude
plt.title('magnitude')
plt.subplot(2,1,2)
plt.plot(np.angle(xdft)) # pi phase shift out of phase
plt.title('phase')
plt.show()
xdft_short = xdft[0:nfft/2+1] # take only positive frequency terms, other half identical
# xdft[0] is the dc term
# xdft[nfft/2] is the Nyquist term, note that Python 2.X indexing does NOT
# include the last element, therefore we need to use 0:nfft/2+1 to have an array
# that is from 0 to nfft/2
# xdft[nfft/2-x] = conjugate(xdft[nfft/2+x])
Pxx = (np.abs(xdft_short))**2 # power ~ voltage squared, power in each bin.
Pxx_density = Pxx / (sample_rate/nfft) # power is energy over -fs/2 to fs/2, with nfft bins
Pxx_density[1:-1] = 2*Pxx_density[1:-1] # conserve power since we threw away 1/2 the spectrum
# note that DC (0 frequency) and Nyquist term only appear once, we don't double those.
# Note that Python 2.X array indexing is not inclusive of the last element.
Pxx_density_dB = 10*np.log10(Pxx_density)
freq = np.linspace(0,sample_rate/2,nfft/2+1)
# frequency range of the fft spans from DC (0 Hz) to
# Nyquist (Fs/2).
# the resolution of the FFT is 1/t_stop
# dft of size nfft will give nfft points at frequencies
# (1/stop) to (nfft/2)*(1/t_stop)
plt.figure(1)
plt.plot(freq,Pxx_density_dB,'^')
plt.figure(1)
plt.plot(f, 10.0*np.log10(Pxx_den))
plt.xlabel('Freq (Hz)'),plt.ylabel('dBm/Hz'),#plt.show()
plt.ylim([-200, 0])
plt.show()
Many common (but not all) FFT libraries scale the FFT result of a unit amplitude sinusoid by the length of the FFT. This maintains Parsevals equality since a longer sinusoid represents more total energy than a shorter one of the same amplitude.
If you don't want to scale by the FFT length when using one of these libraries, then divide by the length before computing the magnitude in dB.
Normalization can be done in many different ways - depending on window, number of samples, etc.
Common trick: take FFT of known signal and normalize by the value of the peak. Say in the above example your peak is 123 - if you want it to be 1, then divide it ( and all results obtained with this algorithm) by 123.

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Estimated size of the self-join operation on a relation R, given a histogram for R

Query optimizers typically use summaries of data distributions to estimate the sizes of the intermediate tables generated during query processing. One popular such summarization scheme is a histogram, whereby the input range is partitioned into buckets and a cumulative count is maintained of the number of tuples falling in each bucket. The distribution within a bucket is assumed to be uniform for the purposes of estimation.
The following shows one such histogram for a relation R on a discrete attribute a with domain [1..10]:
Bucket 1: range = [1..2] Cumulative tuple count = 6
Bucket 2: range = [3..8] Cumulative tuple count = 30
Bucket 3: range = [9..10] Cumulative tuple count = 10
What is the estimated size of the self-join operation R x R
A) 46
B) 218
C) 248
D) 1,036
E) 5,672
Answer given in solutions : B
How is the answer to be calculated?
The size of a self-join on attribute R is equal to the summation of the frequency of each value of attribute R.
Here the frequency is given in buckets, e.g. the first bucket has 2 values r with frequency = 6, so we can assume the frequency of each value in bucket one is frequency = 3, similarly for bucket two frequency of each = 30/6 = 5, and for bucket three frequency of each value = 10/2 = 5.
Therefore, the size is
Size = [(3^2)*2] + [(5^2)*6] + [(5^2)*2]
= 218
I've been trying to figure this one out myself (it's from the GRE Computer Science subject test preparation exam).
So far I haven't found an answer as to why the answer is 218, but I have found a connection between the numbers given and the correct answer.
It turns out that that sum of the square of the cumulative tuple counts divided by the number of discrete values in each bucket, you get 218. Less abstractly: 6²/2 + 30²/5 + 10²/2 = 218.
It's not an answer, but at least there's a connection =)

Resources