Less bits changes but same PSNR value - image-processing

We have proposed and simulated an LSB insertion method that does 10% less bits changes then the regular LSB. However to our surprise the PSNR value of the proposed method is very close to the regular LSB ( around ~1% increase )
It is really confusing as our proposed method does less LSB changes but the PSNR value is still the same. Would appreciate any help.
We are using Matlab and testing on RGB image ( embeding on LSB of all three RGB channels )

I think your results are not unexpected and that you're probably overestimating the effect because PSNR is a logarithmic function. I will plug in some numbers to show you that effect, but I will make some assumptions about details you haven't mentioned.
I assume that you embed only 1 bit per pixel and that is in the least significant bit and not in the 2nd, 3rd or any other.
What this means is that if a bit flips, the square error between the original and the modified pixel is 1. If we assume a stream of random bits, we expect to change half of the pixels where we embed our information. All of this can be encapsulated in the following two functions.
mse = #(bits, x, y) (0.5 * bits) / (x * y)
psnr = #(signal, noise) = 10 * log10(signal^2 / noise)
Where bits is the total number of embedding bits, x and y the size of the image, signal is the maximum intensity value in the original image (I used 255) and noise is the mse.
Let's assume a 512x512 image. For simplicity, let's also calculate the PSNR for only one colour component.
5,000 bits: 68.3368 db
4,500 bits: 68.7944 db
PSNR increase: 0.6696%
100,000 bits: 55.3265 db
90,000 bits: 55.7841 db
PSNR increase: 0.8271%
In fact, you can derive a general symbolic form.
mse_orig = (bits / 2) / (x * y) = bits / (2 * x * y)
mse_steg = (0.9 * bits / 2) / (x * y) = 0.9 * mse_orig
psnr = 10 * log10(signal^2 / mse)
percent_change = 100 * (psnr_steg - psnr_orig) / psnr_orig
By using the property log(a / b) = log(a) - log(b), the above can be simplified to:
percent_change = 100 * (-log10(0.9) / log10(2* x * y * signal^2 / bits))
Where bits is the number of bits you embed with regular lsb embedding (i.e., 5,000 and NOT 4,500).


Where is OpenCV's getGaussianKernel method's formula for sigma coming from? [duplicate]

I find on the OpenCV documentation for cvSmooth that sigma can be calculated from the kernel size as follows:
sigma = 0.3(n/2 - 1) + 0.8
I would like to know the theoretical background of this equation.
Thank you.
Using such a value for sigma, the ratio between the value at the centre of the kernel and on the edge of the kernel, found for y=0 and x=n/2-1, is:
g_edge / g_center = exp(-(x²+y²)/(2σ²))
= exp(-(n/2-1)²/(2*(0.3(n/2-1)+0.8)²))
The limit of this value as n increases is:
exp(-1/(2*0.3²)) = 0.00386592
Note that 1/256 is 0.00390625. Images are often encoded in 256-value ranges. The choice of 0.3 ensures that the kernel considers all pixels that may significantly influence the resulting value.
I am afraid I do not have an explanation for the 0.8 part, but I imagine it is here to ensure reasonable values when n is small.

How best to use Eign to filter & resize an image?

I have an image, a 2D array of uint8_ts. I want to resize the image using a separable filter. Consider shrinking the width first. Because original & target sizes are unrelated, we'll use a different set of coefficients for every destination pixel. For a particular in & out size, for all y, we might have:
out(500, y) = in(673, y) * 12 + in(674, y) * 63 + in(675, y) * 25
out(501, y) = in(674, y) * 27 + in(675, y) * 58 + in(676, y) * 15
How can I use Eigen to speed this up, e.g. vectorize it for me?
This can be expressed as a matrix multiply with a sparse matrix, of dimension in_width * out_width, where in each row, only 3 out of the in_width values are non-zero. In the actual use case, 4 to 8 will typically be non-zero. But those non-zero values will be contiguous, and it would be nice to use SSE or whatever to make it fast.
Note that the source matrix has 8 bit scalars. The final result, after scaling both width & height will be 8 bits as well. It might be nice for the intermediate matrix (image) and filter to be higher precision, say 16 bits. But even if they're 8 bits, when multiplying, we'll need to take the most significant bits of the product, not the least significant.
Is this too far out of with Eigen can do? This sort of convolution, with a kernel that's different at every pixel (only because the output size isn't an integral multiple of the input size), seems common enough.

Programmatically performing gradient calculation

Let y = Relu(Wx) where W is a 2d matrix representing a linear transformation on x, a vector. Likewise, let m = Zy, where Z is a 2d matrix representing a linear transformation on y. How do I programmatically calculate the gradient of Loss = sum(m^2) with respect to W, where the power means take the element wise power of the resulting vector, and sum means adding all the elements together?
I can work this out slowly mathematically by taking a hypothetical, multiplying it all out, then element-by-element taking the derivative to construct the gradient, but I can't figure out an efficient approach to write a program once the neural network layer becomes >1.
Say, for just one layer (m = Zy, take gradient wrt Z) I could just say
Loss = sum(m^2)
dLoss/dZ = 2m * y
where * is the outer product of the vectors, and I guess this is kind of like normal calculus and it works. Now for 2 layers + activation (gradient wrt W), if I try to do it like "normal" calculus and apply the chain rule I get:
dLoss/dW = 2m * Z * dRelu * x
where dRelu is the derivative of Relu(Wx) except here I have no idea what * means in this case to make it work.
Is there an easy way to calculate this gradient mathematically without basically multiplying it all out and deriving each separate element in the gradient? I'm really unfamiliar with matrix calculus, so if anyone could also give some mathematical intuition, if my attempt is completely wrong, that would be appreciated.
For the sake of convenience, let's ignore the ReLU for a moment. You have an input space X (of some size [dimX]) mapped to an intermediate space Y (of some size [dimY]) mapped to an output space m (of some size [dimM]) You have, then, W: X → Y a matrix of shape [dimY, dimX] and Z: Y → m a matrix of shape [dimM, dimY]. Finally your loss is simply a function that maps your M space to a scalar value.
Let us walk the way backwards. As you correctly said, you want to compute the derivative of the loss w.r.t W and to do so you need to apply the chain rule all the way back. You then have:
dL/dW = dL/dm * dm/dY * dY/dW
dL/dm is of shape [dimm] (a scalar function with derivatives across dimm dimensions)
dm/dY is of shape [dimm, dimY] (an m-dimensional function with derivatives across dimY dimensions)
dY/dW is of shape [dimY, dimW] = [dimY, dimY, dimX] (a y-dimensional function with derivatives across [dimY, dimX] dimensions)
To make the last bit more clear, Y consists of dimY different values, so Y can be treated as dimY constituent functions. We need to apply the gradient operator on each of those mini-functions, all with respect to the basis vectors defined by W. More concretely, if W = [[w11, w12], [w21, w22], [w31, w32]] and x = [x1, x2], then Y = [y1, y2, y3] = [w11x1 + w12x2, w21x1 + w22x2, w31x1 + w32x2]. Then W defines a 6d space (3x2) across which we need to differentiate. We have dY/dW = [dy1/dW, dy2/dW, dy3/dW], and also dy1/dW = [[dy1/dw11, dy1/dw12], [dy1/dw21, dy1/dw22], [dy1/dw31, dy1/dw32]] = [[x1,x2],[0,0],[0,0]], a 3x2 matrix. So dY/dW is a [3,3,2] tensor.
As for the multiplication part; the operation here is tensor contraction (essentially matrix multiplication in high dimension spaces). Practically, if you have a high-order tensor A[[a1, a2, a3... ], β] (i.e. a+1 dimensions, the last of which is of size β) and a tensor B[β, [b1, b2...]] (i.e. b+1 dimensions, the first of which is β), their tensor contraction is a matrix C[[a1,a2...], [b1,b2...]] (i.e. a+b dimensions, the β dimension contracted), where C is obtained by summing over element-wise across the shared dimension β (refer to https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot).
The resulting tensor contraction then is a matrix of shape [dimY, dimX] which can be used to update your W weights. The ReLU which we ignored earlier can easily be thrown in the mix, since ReLU: 1 → 1 is a scalar function applied element-wise on Y.
To summarize, your code would be:
W_gradient = 2m * np.dot(Z, x) * np.e**x/(1+np.e**x))
I just implemented several multiplier neural networks(MLP) from scratch in C++[1], and I think I know what's your pain. And believe me, you don't even need any third party matrix/tensor/automatic differentiation(AD) libraries to do the matrix multiplication or gradient calculation. There are three things you should pay attention to:
There are two kinds of multiplication in the equations: matrix multiplication, and elementwise multiplication, you'll mess up if you denoted them all as a single *.
Use concrete examples, especially concrete numbers as dimensions of your data/matrix/vector to build intuition.
The most powerful tool for programming correctly is dimension compatibility, always don't forget to check dimensions.
Suppose your want to do binary classification and the neural network is input -> h1 -> sigmoid -> h2 -> sigmoid -> loss in which input layer has 1 sample each has 2 features, h1 has 7 neurons, and h2 has 2 neurons. Then:
forward pass:
Z1(1, 7) = X(1, 2) * W1(2, 7)
A1(1, 7) = sigmoid(Z1(1, 7))
Z2(1, 2) = A1(1, 7) * W2(7, 2)
A2(1, 2) = sigmoid(Z2(1, 2))
Loss = 1/2(A2 - label)^2
backward pass:
dA2(1, 2) = dL/dA2 = A2 - label
dZ2(1, 2) = dL/dZ2 = dA2 * dsigmoid(A2_i) -- element wise
dW2(7, 2) = A1(1, 7).T * dZ2(1, 2) -- matrix multiplication
Notice the last equation, the dimension of the gradient of W2 should match W2, which is (7, 2). And the only way to get a (7, 2) matrix is to transpose input A1 and multiply A1 with dZ2, that's dimension compatibility[2].
backward pass continued:
dA1(1, 7) = dZ2(1, 2) * A1(2, 7) -- matrix multiplication
dZ1(1, 7) = dA1(1, 7) * dsigmoid(A1_i) -- element wise
dW1(2, 7) = X.T(2, 1) * dZ1(1, 7) -- matrix multiplication
[1] The code is here, you can see the hidden layer implementation, naive matrix implementation and the references listed there.
[2] I omit the matrix derivation part, it's simple actually but hard to type the equations out. I strongly suggest you read this paper, every tiny detail you should know on matrix derivation in DL is listed in this paper.
[3] One sample as input is used in the above example(as a vector), you can substitute 1 with any batch numbers(become matrix), and the equations still hold.

Explanation of rho and theta parameters in HoughLines

Can you give me a quick definition of rho and theta parameters in OpenCV's HoughLines function
void cv::HoughLines ( InputArray image,
OutputArray lines,
double rho,
double theta,
int threshold,
double srn = 0,
double stn = 0,
double min_theta = 0,
double max_theta = CV_PI
The only thing I found in the doc is:
rho: Distance resolution of the accumulator in pixels.
theta: Angle resolution of the accumulator in radians.
Do this mean that if I set rho=2 then 1/2 of my image's pixels will be ignored ... a kind of stride=2 ?
I have searched for this for hours and still haven't found a place where it is neatly explained. But picking up the pieces, I think I got it.
The algorithm goes over every edge pixel (result of Canny, for example) and calculates ρ using the equation ρ = x * cosθ + y * sinθ, for many values of θ.
The actual step of θ is defined by the function parameter, so if you use the usual math.pi / 180.0 value of theta, the algorithm will compute ρ 180 times in total for just one edge pixel in the image. If you would use a larger theta, there would be fewer calculations, fewer accumulator columns/buckets and therefore fewer lines found.
The other parameter ρ defines how "fat" a row of the accumulator is. With a value of 1, you are saying that you want the number of accumulator rows to be equal to the biggest ρ possible, which is the diagonal of the image you're processing. So if for some two values of θ you get close values for ρ, they will still go into separate accumulator buckets because you are going for precision. For a larger value of the parameter rho, those two values might end up in the same bucket, which will ultimately give you more lines because more buckets will have a large vote count and therefore exceed the threshold.
Some helpful resources:
To detect lines with Hough Transform, the best way is to represents lines with an equation of two parameters rho and theta as shown on this image. The equation is the following :
x cos⁡(θ)+y sin⁡(θ)=ρ
where (x,y) are line parameters.
This writing in (θ,ρ) parameters allow the detection to be less position-depending than a writing as y=a*x+b
(θ,ρ) in this context give the discretization for these two parameters

Calculate the Gaussian filter's sigma using the kernel's size

I find on the OpenCV documentation for cvSmooth that sigma can be calculated from the kernel size as follows:
sigma = 0.3(n/2 - 1) + 0.8
I would like to know the theoretical background of this equation.
Thank you.
Using such a value for sigma, the ratio between the value at the centre of the kernel and on the edge of the kernel, found for y=0 and x=n/2-1, is:
g_edge / g_center = exp(-(x²+y²)/(2σ²))
= exp(-(n/2-1)²/(2*(0.3(n/2-1)+0.8)²))
The limit of this value as n increases is:
exp(-1/(2*0.3²)) = 0.00386592
Note that 1/256 is 0.00390625. Images are often encoded in 256-value ranges. The choice of 0.3 ensures that the kernel considers all pixels that may significantly influence the resulting value.
I am afraid I do not have an explanation for the 0.8 part, but I imagine it is here to ensure reasonable values when n is small.
