SSE: shuffle (permutevar) 4x32 integers

SSE: shuffle (permutevar) 4x32 integers - sse

I have some code using the AVX2 intrinsic _mm256_permutevar8x32_epi32 aka vpermd to select integers from an input vector by an index vector. Now I need the same thing but for 4x32 instead of 8x32. _mm_permutevar_ps does it for floating point, but I'm using integers.
One idea is _mm_shuffle_epi32, but I'd first need to convert my 4x32 index values to a single integer, that is:
imm[1:0] := idx[31:0]
imm[3:2] := idx[63:32]
imm[5:4] := idx[95:64]
imm[7:6] := idx[127:96]
I'm not sure what's the best way to do that, and moreover I'm not sure it's the best way to proceed. I'm looking for the most efficient method on Broadwell/Haswell to emulate the "missing" _mm_permutevar_epi32(__m128i a, __m128i idx). I'd rather use 128-bit instructions than 256-bit ones if possible (i.e. I don't want to widen the 128-bit inputs then narrow the result).

It's useless to generate an immediate at run-time, unless you're JITing new code. An immediate is a byte that's literally part of the machine-code instruction encoding. That's great if you have a compile-time-constant shuffle (after inlining + template expansion), otherwise forget about those shuffles that take the control operand as an integer1.
Before AVX, the only variable-control shuffle was SSSE3 pshufb. (_mm_shuffle_epi8). That's still the only 128-bit (or in-lane) integer shuffle instruction in AVX2 and I think AVX512.
AVX1 added some in-lane 32-bit variable shuffles, like vpermilps (_mm_permutevar_ps). AVX2 added lane-crossing integer and FP shuffles, but somewhat strangely no 128-bit version of vpermd. Perhaps because Intel microarchitectures have no penalty for using FP shuffles on integer data. (Which is true on Sandybridge family, I just don't know if that was part of the reasoning for the ISA design). But you'd think they would have added __m128i intrinsics for vpermilps if that's what you were "supposed" to do. Or maybe the compiler / intrinsics design people didn't agree with the asm instruction-set people?
If you have a runtime-variable vector of 32-bit indices and want to do a shuffle with 32-bit granularity, by far your best bet is to just use AVX _mm_permutevar_ps.
_mm_castps_si128( _mm_permutevar_ps (_mm_castsi128_ps(a), idx) )
On Intel at least, it won't even introduce any extra bypass latency when used between integer instructions like paddd; i.e. FP shuffles specifically (not blends) have no penalty for use on integer data in Sandybridge-family CPUs.
If there's any penalty on AMD Bulldozer or Ryzen, it's minor and definitely cheaper than the cost of calculating a shuffle-control vector for (v)pshufb.
Using vpermd ymm and ignoring the upper 128 bits of input and output (i.e. by using cast intrinsics) would be much slower on AMD (because its 128-bit SIMD design has to split lane-crossing 256-bit shuffles into several uops), and also worse on Intel where it makes it 3c latency instead of 1 cycle.
#Iwill's answer shows a way to calculate a shuffle-control vector of byte indices for pshufb from a vector of 4x32-bit dword indices. But it uses SSE4.1 pmulld which is 2 uops on most CPUs, and could easily be a worse bottleneck than shuffles. (See discussion in comments under that answer.) Especially on older CPUs without AVX, some of which can do 2 pshufb per clock unlike modern Intel (Haswell and later only have 1 shuffle port and easily bottleneck on shuffles. IceLake will add another shuffle port, according to Intel's Sunny Cove presentation.)
If you do have to write an SSSE3 or SSE4.1 version of this, it's probably best to still use only SSSE3 and use pshufb plus a left shift to duplicate a byte within a dword before ORing in the 0,1,2,3 into the low bits, not pmulld. SSE4.1 pmulld is multiple uops and even worse than pshufb on some CPUs with slow pshufb. (You might not benefit from vectorizing at all on CPUs with only SSSE3 and not SSE4.1, i.e. first-gen Core2, because it has slow-ish pshufb.)
On 2nd-gen Core2, and Goldmont, pshufb is a single-uop instruction with 1-cycle latency. On Silvermont and first-gen Core 2 it's not so good. But overall I'd recommend pshufb + pslld + por to calculate a control-vector for another pshufb if AVX isn't available.
An extra shuffle to prepare for a shuffle is far worse than just using vpermilps on any CPU that supports AVX.
Footnote 1:
You'd have to use a switch or something to select a code path with the right compile-time-constant integer, and that's horrible; only consider that if you don't even have SSSE3 available. It may be worse than scalar unless the jump-table branch predicts perfectly.

Although Peter Cordes is correct in saying that the AVX instruction vpermilps and its intrinsic _mm_permutevar_ps() will probably do the job, if you're working on machines older than Sandy Bridge, an SSE4.1 variant using pshufb works quite well too.
AVX variant
Credits to #PeterCordes
#include <stdio.h>
#include <immintrin.h>
__m128i vperm(__m128i a, __m128i idx){
return _mm_castps_si128(_mm_permutevar_ps(_mm_castsi128_ps(a), idx));
}
int main(int argc, char* argv[]){
__m128i a = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
__m128i idx = _mm_set_epi32(1,0,3,2);
__m128i shu = vperm(a, idx);
printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
((unsigned*)(&shu))[2],
((unsigned*)(&shu))[1],
((unsigned*)(&shu))[0]);
return 0;
}
SSE4.1 variant
#include <stdio.h>
#include <immintrin.h>
__m128i vperm(__m128i a, __m128i idx){
idx = _mm_and_si128 (idx, _mm_set1_epi32(0x00000003));
idx = _mm_mullo_epi32(idx, _mm_set1_epi32(0x04040404));
idx = _mm_or_si128 (idx, _mm_set1_epi32(0x03020100));
return _mm_shuffle_epi8(a, idx);
}
int main(int argc, char* argv[]){
__m128i a = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
__m128i idx = _mm_set_epi32(1,0,3,2);
__m128i shu = vperm(a, idx);
printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
((unsigned*)(&shu))[2],
((unsigned*)(&shu))[1],
((unsigned*)(&shu))[0]);
return 0;
}
This compiles down to the crisp
0000000000400550 <vperm>:
400550: c5 f1 db 0d b8 00 00 00 vpand 0xb8(%rip),%xmm1,%xmm1 # 400610 <_IO_stdin_used+0x20>
400558: c4 e2 71 40 0d bf 00 00 00 vpmulld 0xbf(%rip),%xmm1,%xmm1 # 400620 <_IO_stdin_used+0x30>
400561: c5 f1 eb 0d c7 00 00 00 vpor 0xc7(%rip),%xmm1,%xmm1 # 400630 <_IO_stdin_used+0x40>
400569: c4 e2 79 00 c1 vpshufb %xmm1,%xmm0,%xmm0
40056e: c3 retq
The AND-masking is optional if you can guarantee that the control indices will always be the 32-bit integers 0, 1, 2 or 3.

Related

SSE divrem memory store requirements

I'm searching for information on the divrem intrinsic sequences and their memory requirements (for the store).
These folks (check SSE and SVML to see the intel intrinsics doc) :
__m128i _mm_idivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_idivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
__m128i _mm_udivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_udivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
On the intel intrinsics guide, it states.
Divide packed 32-bit integers in a by packed elements in b, store the
truncated results in dst, and store the remainders as packed 32-bit
integers into memory at mem_addr.
FOR j := 0 to 3
i := 32*j
dst[i+31:i] := TRUNCATE(a[i+31:i] / b[i+31:i])
MEM[mem_addr+i+31:mem_addr+i] := REMAINDER(a[i+31:i] / b[i+31:i])
ENDFOR
dst[MAX:128] := 0
Does this mean, mem_addr is expected to be aligned (as per store), unaligned (storeu), or is it supposed to be a single register output (__m128i on the stack)?

alignof(__m256i) == 32, so for portability to any other compilers that might implement this intrinsic (like clang-based ICX), you should point it at aligned memory, or just a __m128i / __m256i temporary and use a normal store intrinsic (store or storeu) to tell the compiler where you want it to go.
As Homer512 points out with an example in https://godbolt.org/z/9szzjEo7c , ICC stores it with movdqu. But we can see it always uses unaligned loads/stores, also for deref of __m128i* pointers for inputs. GCC and clang do use alignment-required loads/stores when you promise them alignment (e.g. by deref of a __m128i*).
The actual SVML function call QWORD PTR [__svml_idivrem4#GOTPCREL+rip] returns in XMM0 and XMM1; the by-reference output operand is fortunately an invention of the intrinsics API. So it will fully optimize away to pass the address of __m128i tmp and then store it somewhere.

SSE Shift Instruction zeroes the vector with _mm_set1_epi32() for the count vector?

Here's the situation: m3 = _mm_srli_epi32(m2, 23); does exactly what is expected,
m3 = _mm_srl_epi32(m2, shift); however (shift being initialized as __m128i shift = _mm_set1_epi32(23);) yields zero.
I've checked and shift does have the value it should have. Is there something simple I may be missing?

_mm_srl_epi32 (__m128i a, __m128i count) takes the count as the low 64 bits of the count vector. set1_epi32(32) is (23<<32) | 23 which is a huge number which shifts out all the bits.
SSE shifts saturate the count (unlike scalar shifts which mask the count).
You want _mm_cvtsi32_si128(int) to zero-extend a single int into a __m128i, or if your shift count is already in a vector you need to isolate it in the low 64 bits of a vector with an AND, shuffle, or whatever.
movq xmm,xmm can zero-extend a 64-bit element to 128, but there's no equivalent for 32-bit elements.

SSE - Non-Existant haddsub intrinsic?

While crawling through the intrinsics avaiable, I've noticed there's nowhere to be seen an horizontal addsub/subadd intruction avaiable. It's avaiable in the obsolete 3DNow! extension however it's use is impratical for obvious reasons. What is the reason for such "basic" operation not being implemented in the SSE3 extension along with similar horizontal and addsub operations?
And by the way what's the fastest alternative in a modern instruction set (SSE3, SSE4, AVX, ...)? (with 2 doubles per value i.e. __m128d)

Generally you want to avoid designing your code to use horizontal ops in the first place; try to do the same thing to multiple data in parallel, instead of different things with different elements. But sometimes a local optimization is still worth it, and horizontal stuff can be better than pure scalar.
Intel experimented with adding horizontal ops in SSE3, but never added dedicated hardware to support them. They decode to 2 shuffles + 1 vertical op on all CPUs that support them (including AMD). See Agner Fog's instruction tables. More recent ISA extensions have mostly not included more horizontal ops, except for SSE4.1 dpps/dppd (which is also usually not worth using vs. manually shuffling).
SSSE3 pmaddubsw makes sense because element-width is already a problem for widening multiplication, and SSE4.1 phminposuw got dedicated HW support right away to make it worth using (and doing the same thing without it would cost a lot of uops, and it's specifically very useful for video encoding). But AVX / AVX2 / AVX512 horizontal ops are very scarce. AVX512 did introduce some nice shuffles, so you can build your own horizontal ops out of the powerful 2-input lane-crossing shuffles if needed.
If the most efficient solution to your problem already includes shuffling together two inputs two different ways and feeding that to an add or sub, then sure, haddpd is an efficient way to encode that; especially without AVX where preparing the inputs might have required a movaps instruction as well because shufpd is destructive (silently emitted by the compiler when using intrinsics, but still costs front-end bandwidth, and latency on CPUs like Sandybridge and earlier which don't eliminate reg-reg moves).
But if you were going to use the same input twice, haddpd is the wrong choice. See also Fastest way to do horizontal float vector sum on x86. hadd / hsub are only a good idea with two different inputs, e.g. as part of an on-the-fly transpose as part of some other operation on a matrix.
Anyway, the point is, build your own haddsub_pd if you want it, out of two shuffles + SSE3 addsubpd (which does have single-uop hardware support on CPUs that support it.) With AVX, it will be just as fast as a hypothetical haddsubpd instruction, and without AVX will typically cost one extra movaps because the compiler needs to preserve both inputs to the first shuffle. (Code-size will be bigger, but I'm talking about cost in uops for the front-end, and execution-port pressure for the back-end.)
// Requires SSE3 (for addsubpd)
// inputs: a=[a1 a0] b=[b1 b0]
// output: [b1+b0, a1-a0], like haddpd for b and hsubpd for a
static inline
__m128d haddsub_pd(__m128d a, __m128d b) {
__m128d lows = _mm_unpacklo_pd(a,b); // [b0, a0]
__m128d highs = _mm_unpackhi_pd(a,b); // [b1, a1]
return _mm_addsub_pd(highs, lows); // [b1+b0, a1-a0]
}
With gcc -msse3 and clang (on Godbolt) we get the expected:
movapd xmm2, xmm0 # ICC saves a code byte here with movaps, but gcc/clang use movapd on double vectors for no advantage on any CPU.
unpckhpd xmm0, xmm1
unpcklpd xmm2, xmm1
addsubpd xmm0, xmm2
ret
This wouldn't typically matter when inlining, but as a stand-alone function gcc and clang have trouble when they need the return value in the same register that b starts in, instead of a. (e.g. if the args are reversed so it's haddsub(b,a)).
# gcc for haddsub_pd_reverseargs(__m128d b, __m128d a)
movapd xmm2, xmm1 # copy b
unpckhpd xmm1, xmm0
unpcklpd xmm2, xmm0
movapd xmm0, xmm1 # extra copy to put the result in the right register
addsubpd xmm0, xmm2
ret
clang actually does a better job, using a different shuffle (movhlps instead of unpckhpd) to still only use one register-copy:
# clang5.0
movapd xmm2, xmm1 # clangs comments go in least-significant-element first order, unlike my comments in the source which follow Intel's convention in docs / diagrams / set_pd() args order
unpcklpd xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0]
movhlps xmm0, xmm1 # xmm0 = xmm1[1],xmm0[1]
addsubpd xmm0, xmm2
ret
For an AVX version with __m256d vectors, the in-lane behaviour of _mm256_unpacklo/hi_pd is actually what you want, for once, to get the even / odd elements.
static inline
__m256d haddsub256_pd(__m256d b, __m256d a) {
__m256d lows = _mm256_unpacklo_pd(a,b); // [b2, a2 | b0, a0]
__m256d highs = _mm256_unpackhi_pd(a,b); // [b3, a3 | b1, a1]
return _mm256_addsub_pd(highs, lows); // [b3+b2, a3-a2 | b1+b0, a1-a0]
}
# clang and gcc both have an easy time avoiding wasted mov instructions
vunpcklpd ymm2, ymm1, ymm0 # ymm2 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
vunpckhpd ymm0, ymm1, ymm0 # ymm0 = ymm1[1],ymm0[1],ymm1[3],ymm0[3]
vaddsubpd ymm0, ymm0, ymm2
Of course, if you have the same input twice, i.e. you wanted the sum and difference between the two elements of a vector, you only need one shuffle to feed addsubpd
// returns [a1+a0 a1-a0]
static inline
__m128d sumdiff(__m128d a) {
__m128d swapped = _mm_shuffle_pd(a,a, 0b01);
return _mm_addsub_pd(swapped, a);
}
This actually compiles quite clunkily with both gcc and clang:
movapd xmm1, xmm0
shufpd xmm1, xmm0, 1
addsubpd xmm1, xmm0
movapd xmm0, xmm1
ret
But the 2nd movapd should go away when inlining, if the compiler doesn't need the result in the same register it started with. I think gcc and clang are both missing an optimization here: they could swap xmm0 after copying it:
# compilers should do this, but don't
movapd xmm1, xmm0 # a = xmm1 now
shufpd xmm0, xmm0, 1 # swapped = xmm0
addsubpd xmm0, xmm1 # swapped +- a
ret
Presumably their SSA-based register allocators don't think of using a 2nd register for the same value of a to free up xmm0 for swapped. Usually it's fine (and even preferable) to produce the result in a different register, so this is rarely a problem when inlining, only when looking at the stand-alone version of a function

How about:
__m128d a, b; //your inputs
const __m128d signflip_low_element =
_mm_castsi128_pd(_mm_set_epi64(0,0x8000000000000000));
b = _mm_xor_pd(b, signflip_low_element); // negate b[0]
__m128d res = _mm_hadd_pd(a,b);
This builds haddsubpd in terms of haddpd, so it's only one extra instruction. Unfortunately haddpd is not very fast, with a throughput of one per 2 clock cycles on most CPUs, limited by FP shuffle throughput.
But this way is good for code-size (of the x86 machine code).

how to deinterleave image channel in SSE

is there any way we can DE-interleave 32bpp image channels similar as below code in neon.
//Read all r,g,b,a pixels into 4 registers
uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32);
ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))),
channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high);
basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very familiar with SSE and could not find such instruction in SSE or if some one can provide better ways to do that? Any help is highly appreciated

Since the 8 bit values are unsigned you can just do this with shifting and masking, much like you would for scalar code, e.g.
__m128i vrgba;
__m128i vr = _mm_and_si128(vrgba, _mm_set1_epi32(0xff));
__m128i vg = _mm_and_si128(_mm_srli_epi32(vrgba, 8), _mm_set1_epi32(0xff));
__m128i vb = _mm_and_si128(_mm_srli_epi32(vrgba, 16), _mm_set1_epi32(0xff));
__m128i va = _mm_srli_epi32(vrgba, 24);
Note that I'm assuming your RGBA elements have the R component in the LS 8 bits and the A component in the MS 8 bits, but if they are the opposite endianness you can just change the names of the vr/vg/vb/va vectors.

linear transformation function

I need to write a function that takes 4 bytes as input, performs a reversible linear transformation on this, and returns it as 4 bytes.
But wait, there is more: it also has to be distributive, so changing one byte on the input should affect all 4 output bytes.
The issues:
if I use multiplication it won't be reversible after it is modded 255 via the storage as a byte (and its needs to stay as a byte)
if I use addition it can't be reversible and distributive
One solution:
I could create an array of bytes 256^4 long and fill it in, in a one to one mapping, this would work, but there are issues: this means I have to search a graph of size 256^8 due to having to search for free numbers for every value (should note distributivity should be sudo random based on a 64*64 array of byte). This solution also has the MINOR (lol) issue of needing 8GB of RAM, making this solution nonsense.
The domain of the input is the same as the domain of the output, every input has a unique output, in other words: a one to one mapping. As I noted on "one solution" this is very possible and I have used that method when a smaller domain (just 256) was in question. The fact is, as numbers get big that method becomes extraordinarily inefficient, the delta flaw was O(n^5) and omega was O(n^8) with similar crappiness in memory usage.
I was wondering if there was a clever way to do it. In a nutshell, it's a one to one mapping of domain (4 bytes or 256^4). Oh, and such simple things as N+1 can't be used, it has to be keyed off a 64*64 array of byte values that are sudo random but recreatable for reverse transformations.

Balanced Block Mixers are exactly what you're looking for.
Who knew?

Edit! It is not possible, if you indeed want a linear transformation. Here's the mathy solution:
You've got four bytes, a_1, a_2, a_3, a_4, which we'll think of as a vector a with 4 components, each of which is a number mod 256. A linear transformation is just a 4x4 matrix M whose elements are also numbers mod 256. You have two conditions:
From Ma, we can deduce a (this means that M is an invertible matrix).
If a and a' differ in a single coordinate, then Ma and Ma' must differ in every coordinate.
Condition (2) is a little trickier, but here's what it means. Since M is a linear transformation, we know that
M(a - a) = Ma - Ma'
On the left, since a and a' differ in a single coordinate, a - a has exactly one nonzero coordinate. On the right, since Ma and Ma' must differ in every coordinate, Ma - Ma' must have every coordinate nonzero.
So the matrix M must take a vector with a single nonzero coordinate to one with all nonzero coordinates. So we just need every entry of M to be a non-zero-divisor mod 256, i.e., to be odd.
Going back to condition (1), what does it mean for M to be invertible? Since we're considering it mod 256, we just need its determinant to be invertible mod 256; that is, its determinant must be odd.
So you need a 4x4 matrix with odd entries mod 256 whose determinant is odd. But this is impossible! Why? The determinant is computed by summing various products of entries. For a 4x4 matrix, there are 4! = 24 different summands, and each one, being a product of odd entries, is odd. But the sum of 24 odd numbers is even, so the determinant of such a matrix must be even!

Here are your requirements as I understand them:
Let B be the space of bytes. You want a one-to-one (and thus onto) function f: B^4 -> B^4.
If you change any single input byte, then all output bytes change.
Here's the simplest solution I have thusfar. I have avoided posting for a while because I kept trying to come up with a better solution, but I haven't thought of anything.
Okay, first of all, we need a function g: B -> B which takes a single byte and returns a single byte. This function must have two properties: g(x) is reversible, and x^g(x) is reversible. [Note: ^ is the XOR operator.] Any such g will do, but I will define a specific one later.
Given such a g, we define f by f(a,b,c,d) = (a^b^c^d, g(a)^b^c^d, a^g(b)^c^d, a^b^g(c)^d). Let's check your requirements:
Reversible: yes. If we XOR the first two output bytes, we get a^g(a), but by the second property of g, we can recover a. Similarly for the b and c. We can recover d after getting a,b, and c by XORing the first byte with (a^b^c).
Distributive: yes. Suppose b,c, and d are fixed. Then the function takes the form f(a,b,c,d) = (a^const, g(a)^const, a^const, a^const). If a changes, then so will a^const; similarly, if a changes, so will g(a), and thus so will g(a)^const. (The fact that g(a) changes if a does is by the first property of g; if it didn't then g(x) wouldn't be reversible.) The same holds for b and c. For d, it's even easier because then f(a,b,c,d) = (d^const, d^const, d^const, d^const) so if d changes, every byte changes.
Finally, we construct such a function g. Let T be the space of two-bit values, and h : T -> T the function such that h(0) = 0, h(1) = 2, h(2) = 3, and h(3) = 1. This function has the two desired properties of g, namely h(x) is reversible and so is x^h(x). (For the latter, check that 0^h(0) = 0, 1^h(1) = 3, 2^h(2) = 1, and 3^h(3) = 2.) So, finally, to compute g(x), split x into four groups of two bits, and take h of each quarter separately. Because h satisfies the two desired properties, and there's no interaction between the quarters, so does g.

I'm not sure I understand your question, but I think I get what you're trying to do.
Bitwise Exclusive Or is your friend.
If R = A XOR B, R XOR A gives B and R XOR B gives A back. So it's a reversible transformation, assuming you know the result and one of the inputs.

Assuming I understood what you're trying to do, I think any block cipher will do the job.
A block cipher takes a block of bits (say 128) and maps them reversibly to a different block with the same size.
Moreover, if you're using OFB mode you can use a block cipher to generate an infinite stream of pseudo-random bits. XORing these bits with your stream of bits will give you a transformation for any length of data.

I'm going to throw out an idea that may or may not work.
Use a set of linear functions mod 256, with odd prime coefficients.
For example:
b0 = 3 * a0 + 5 * a1 + 7 * a2 + 11 * a3;
b1 = 13 * a0 + 17 * a1 + 19 * a2 + 23 * a3;
If I remember the Chinese Remainder Theorem correctly, and I haven't looked at it in years, the ax are recoverable from the bx. There may even be a quick way to do it.
This is, I believe, a reversible transformation. It's linear, in that af(x) mod 256 = f(ax) and f(x) + f(y) mod 256 = f(x + y). Clearly, changing one input byte will change all the output bytes.
So, go look up the Chinese Remainder Theorem and see if this works.

What you mean by "linear" transformation?
O(n), or a function f with f(c * (a+b)) = c * f(a) + c * f(b)?
An easy approach would be a rotating bitshift (not sure if this fullfils the above math definition). Its reversible and every byte can be changed. But with this it does not enforce that every byte is changed.
EDIT: My solution would be this:
b0 = (a0 ^ a1 ^ a2 ^ a3)
b1 = a1 + b0 ( mod 256)
b2 = a2 + b0 ( mod 256)
b3 = a3 + b0 ( mod 256)
It would be reversible (just subtract the first byte from the other, and then XOR the 3 resulting bytes on the first), and a change in one bit would change every byte (as b0 is the result of all bytes and impacts all others).

Stick all of the bytes into 32-bit number and then do a shl or shr (shift left or shift right) by one, two or three. Then split it back into bytes (could use a variant record). This will move bits from each byte into the adjacent byte.
There are a number of good suggestions here (XOR, etc.) I would suggest combining them.

You could remap the bits. Let's use ii for input and oo for output:
oo[0] = (ii[0] & 0xC0) | (ii[1] & 0x30) | (ii[2] & 0x0C) | (ii[3] | 0x03)
oo[1] = (ii[0] & 0x30) | (ii[1] & 0x0C) | (ii[2] & 0x03) | (ii[3] | 0xC0)
oo[2] = (ii[0] & 0x0C) | (ii[1] & 0x03) | (ii[2] & 0xC0) | (ii[3] | 0x30)
oo[3] = (ii[0] & 0x03) | (ii[1] & 0xC0) | (ii[2] & 0x30) | (ii[3] | 0x0C)
It's not linear, but significantly changing one byte in the input will affect all the bytes in the output. I don't think you can have a reversible transformation such as changing one bit in the input will affect all four bytes of the output, but I don't have a proof.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart