How could I test if two bit patterns differs in any N bits (position doesn't matter) - comparison

Let's say i have this bit field value: 10101001
How would i test if any other value differs in any n bits. Without considering
the positions?
Example:
10101001
10101011 --> 1 bit different
10101001
10111001 --> 1 bit different
10101001
01101001 --> 2 bits different
10101001
00101011 --> 2 bits different
I need to make a lot of this comparisons so i'm primarily looking for perfomance but any
hint is very welcome.

Take the XOR of the two fields and do a population count of the result.

if you XOR the 2 values together, you are left only with the bits that are different.
You then only need to count the bits which are still 1 and you have your answer
in c:
unsigned char val1=12;
unsigned char val2=123;
unsigned char xored = val1 ^ val2;
int i;
int numBits=0;
for(i=0; i<8; i++)
{
if(xored&1) numBits++;
xored>>=1;
}
although there are probably faster ways to count the bits in a byte
(you could for instance use a lookuptable for 256 values)

Just like everybody else said, use XOR to determine what's different and then use one of these algorithms to count.

This gets the bit difference between the values and counts the bits three at a time:
public static int BitDifference(int a, int b) {
int cnt = 0, bits = a ^ b;
while (bits != 0) {
cnt += (0xE994 >> ((bits & 7) << 1)) & 3;
bits >>= 3;
}
return cnt;
}

XOR the numbers, then the problem becomes a matter of counting the 1s in the result.

In Java:
Integer.bitCount(a ^ b)

Comparison is performed with XOR, as others already answered.
counting can be performed in several ways:
shift left and addition.
lookup in a table.
logic formulas that you can find with Karnaugh maps.

Related

How would one create a bitwise rotation function in dart?

I'm in the process of creating a cryptography package for Dart (https://pub.dev/packages/steel_crypt). Right now, most of what I've done is either exposed from PointyCastle or simple-ish algorithms where bitwise rotations are unnecessary or replaceable by >> and <<.
However, as I move toward complicated cryptography solutions, which I can do mathematically, I'm unsure of how to implement bitwise rotation in Dart with maximum efficiency. Because of the nature of cryptography, the speed part is emphasized and uncompromising, in that I need the absolute fastest implementation.
I've ported a method of bitwise rotation from Java. I'm pretty sure this is correct, but unsure of the efficiency and readability:
My tested implementation is below:
int INT_BITS = 64; //Dart ints are 64 bit
static int leftRotate(int n, int d) {
//In n<<d, last d bits are 0.
//To put first 3 bits of n at
//last, do bitwise-or of n<<d with
//n >> (INT_BITS - d)
return (n << d) | (n >> (INT_BITS - d));
}
static int rightRotate(int n, int d) {
//In n>>d, first d bits are 0.
//To put last 3 bits of n at
//first, we do bitwise-or of n>>d with
//n << (INT_BITS - d)
return (n >> d) | (n << (INT_BITS - d));
}
EDIT (for clarity): Dart has no unsigned right or left shift, meaning that >> and << are signed right shifts, which bears more significance than I might have thought. It poses a challenge that other languages don't in terms of devising an answer. The accepted answer below explains this and also shows the correct method of bitwise rotation.
As pointed out, Dart has no >>> (unsigned right shift) operator, so you have to rely on the signed shift operator.
In that case,
int rotateLeft(int n, int count) {
const bitCount = 64; // make it 32 for JavaScript compilation.
assert(count >= 0 && count < bitCount);
if (count == 0) return n;
return (n << count) |
((n >= 0) ? n >> (bitCount - count) : ~(~n >> (bitCount - count)));
}
should work.
This code only works for the native VM. When compiling to JavaScript, numbers are doubles, and bitwise operations are only done on 32-bit numbers.

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

arc4random() gives negative Number in iOS

Sometime arc4random() gives negative number also in objective C.
My code is as follow:
Try 1:
long ii = arc4random();
Try 2:
int i = arc4random();
How can I only get positivite random number?
Thank you,
No, it's always positive as it returns an unsigned 32-bit integer (manpage):
u_int32_t arc4random(void);
You are treating it as a signed integer, which is incorrect.
You should use the arc4random_uniform() function. this is the most common random function used.
arc4random_uniform() function
Returns a random number between 0 and the inserted parameter minus 1.
For example arc4random_uniform(3) may return 0, 1 or 2 but not 3.
Example
u_int32_t randomPositiveNo = arc4random_uniform(5) + 1; //to get the range 1 - 5

Integer overflow in Fibonacci number

I was solving this codechef problem on Fibonacci numbers. It says number is of 1000 digits then why it is not causing integer overflow in tester's solution when it is scanning the array and storing it in unsigned long long int. I can't understand how solution is working. Below is the problem and tester's solution.
The Head Chef has been playing with Fibonacci numbers for long . He has learnt several tricks related to Fibonacci numbers . Now he wants to test his chefs in the skills .
A fibonacci number is defined by the recurrence :
f(n) = f(n-1) + f(n-2) for n > 2
and f(1) = 0
and f(2) = 1 .
Given a number A , determine if it is a fibonacci number.
Input
The first line of the input contains an integer T denoting the number of test cases. The description of T test cases follows.
The only line of each test case contains a single integer A denoting the number to be checked .
Output
For each test case, output a single line containing "YES" if the given number is a fibonacci number , otherwise output a single line containing "NO" .
Constraints
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.
Example
Input:
3
3
4
5
Output:
YES
NO
YES
**Tester's solution:**
#include <iostream>
#include <cstdio>
#include <algorithm>
#include <set>
#include <cstring>
using namespace std;
int const mx = 6666;
set <unsigned long long> f;
unsigned long long fib[mx + 10];
char s[mx + 1];
int main(){
// freopen("input.txt", "r", stdin);
// freopen("output.txt", "w", stdout);
fib[0] = 0;
fib[1] = 1;
f.insert(1);
f.insert(0);
int i;
for (i = 2; i <= mx; i++){
fib[i] = fib[i - 1] + fib[i - 2];
f.insert(fib[i]);
}
int tc;
cin>>tc;
while (tc--){
unsigned long long n = 0, ten = 10;
cin>>s;
int len = strlen(s);
for (i = 0; i < len; i++){
char q = s[i];
unsigned long long a = q - '0';
n = n * ten + a;
}
if (f.find(n) == f.end()) printf("NO\n");
else printf("YES\n");
}
return 0;
}
From cplusplus you will see that,
ULLONG_MAX Maximum value for an object of type unsigned long long int is 18446744073709551615 (264-1) or greater.
The actual value depends on the particular system and library implementation, but shall reflect the limits of these types in
the target platform.
Above information is just to let you know its a BIG number. Moreover the cause of not getting overflow is not the limit i mentioned.
Most probably, the input file of judge does not contain any input that can cause an overflow.
And its still possible to set such input even after fulfilling the conditions,
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.

Why is arc4random() behaving different when storing it in a variable or not?

int chance = -5;
int rand = arc4random() % 100; // Number from 0 to 99
if (rand <= chance) { // This will never happen
NSLog(#"This is... NOT POSSIBLE");
}
Effectively, this never happens. But
int chance = -5;
if (arc4random() % 100 <= chance) {
NSLog(#"This is... NOT POSSIBLE");
}
Here, instead of storing it in a variable, I placed the random number expression directly in the condition. And the condition is fulfilled (sometimes).
Why is that? How can I debug this behavior?
Type promotion rules.
arc4random returns an unsigned value. That means in your second case, the -5 gets promoted to that same unsigned type, turning it into 4294967291. 4+ billion is definitely larger than any number 0-99!
Let's walk through what happens in both of your examples.
From your first example, in this line:
int rand = arc4random() % 100;
arc4random() returns an unsigned value. So then it looks like:
int rand = someUnsignedNumber % 100;
The 100 is a signed int, so it gets promoted to the same type as someUnsignedNumber, and the % operation is applied. After that you have:
int rand = someUnsignedNumberBetween0And99;
Assigning that unsigned number to int rand makes it back into a signed number. Your comparison then goes forward as expected.
In the second example, you have this line:
if (arc4random() % 100 <= chance)
The same things happen with arc4random() % 100, yielding something like:
if (someUnsignedNumberBetween0And99 <= chance)
But here, chance is a signed number. It gets promoted, changing its value as described above, and you end up with the strange behaviour you're seeing.
Silly, silly type system of C... If you read the man page for arc4random(), you find out that its prototype is
u_int32_t arc4random(void);
So it returns an unsigned integer.
When comparing its - unsigned - result with another integer, the unsignedness "wins": the other value (-5) is promoted to the unsigned type (u_int32_t in this case), it rolls over (since unsigned integer "underflow" is designed to work like this in C - you'll get 2 ^ 32 - 5) and so an "erroneous" (i. e. behaving-as-unexpected) comparison occurs.
When you explicitly assign the value to an int (i. e. signed) variable, this promotion does not occur since the comparison is between two signed types, so it is evaluated as you would expect.

Resources