How to get the bytes occupied by the whole value? - memory

use std::mem;
impl Solution {
pub fn find_substring(s: String, words: Vec<String>) -> Vec<i32> {
let mut result: Vec<i32> = vec![23, 234, 243, 23, 26, 24, 2345];
println!("{}", mem::size_of_val(&result));
result
}
}
But I'm getting 24 for the println!. I'm not sure what 24 bytes refers to. I want to know how to get the total memory the result vector is consuming, including the bytes that store the values and also any additional bytes required for the data structure itself. How can I find that?

The 24 bytes (in a 64bits platform) are just for the size of the struct's fields: pointer, length, capacity.
There's no general way to follow all internal pointers and determine the "real" space used in memory. There's not even an obvious general definition of such space (what do you do if a field is a Rc ?).
What's possible is to define a function to get the size used by vectors, which you get by adding the capacity multiplied by the contained element type's size:
pub fn size_of_vec<T>(vec: &Vec<T>) -> usize {
std::mem::size_of_val(vec) + vec.capacity() * std::mem::size_of::<T>()
}
fn main() {
let mut result: Vec<i32> = vec![23, 234, 243, 23, 26, 24, 2345];
dbg!(size_of_vec(&result));
}
Of course, if the T type refers to heap reserved space, this size_of_vec function can't account for it.

Related

Extract 4 bits vs 2Bit of Bluetooth HEX Data, why same method results in error

This is a follow on question from this SO (Extract 4 bits of Bluetooth HEX Data) which an answer has been accepted. I wanna understand more why the difference between what I was using; example below; (which works) when applied to the SO (Extract 4 bits of Bluetooth HEX Data) does not.
To decode Cycling Power Data, the first 2 bits are the flags and it's used to determine what capabilities the power meter provides.
guard let characteristicData = characteristic.value else { return -1 }
var byteArray = [UInt8](characteristicData)
// This is the output from the Sensor (In Decimal and Hex)
// DEC [35, 0, 25, 0, 96, 44, 0, 33, 229] Hex:{length = 9, bytes = 0x23001900602c0021e5} FirstByte:100011
/// First 2 Bits is Flags
let flags = byteArray[1]<<8 + byteArray[0]
This results in the flags bit being concatenate from the first 2 bits. After which I used the flags bit and masked it to get the relevant bit position.
eg: to get power balance, I do (flags & 0x01 > 0)
This method works and I'm a happy camper.
However, Why is it that when I used this same method on SO Extract 4 bits of Bluetooth HEX Data it does not work? This is decoding Bluetooth FTMS Data (different from above)
guard let characteristicData = characteristic.value else { return -1 }
let byteArray = [UInt8](characteristicData)
let nsdataStr = NSData.init(data: (characteristic.value)!)
print("pwrFTMS 2ACC Feature Array:[\(byteArray.count)]\(byteArray) Hex:\(nsdataStr)")
PwrFTMS 2ACC Feature Array:[8][2, 64, 0, 0, 8, 32, 0, 0] Hex:{length = 8, bytes = 0x0240000008200000}
Based on the specs, the returned data has 2 characteristics, each of them 4 octet long.
doing
byteArray[3]<<24 + byteArray[2]<<16 + byteArray[1]<<8 + byteArray[0]
to join the first 4bytes results in an wrong output to start the decoding.
edit: Added clarification
There is a problem with this code that you say works... but it seems to work "accidentally":
let flags = byteArray[1]<<8 + byteArray[0]
This results in a UInt8, but the flags field in the first table is 16 bits. Note that byteArray[1] << 8 always evaluates to 0, because you are shifting all of the bits of the byte out of the byte. It appeared to work because the only bit you were interested in was in byteArray[0].
So you need it convert it to 16-bit (or larger) first and then shift it:
let flags = (UInt16(byteArray[1]) << 8) + UInt16(byteArray[0])
Now flags is UInt16
Similarly when you do 4 bytes, you need them to be 32-bit values, before you shift. So
let flags = UInt32(byteArray[3]) << 24
+ UInt32(byteArray[2]) << 16
+ UInt32(byteArray[1]) << 8
+ UInt32(byteArray[0])
but since that's just reading a 32-bit value from a sequence of bytes that are in little endian byte order, and all current Apple devices (and the vast majority of all other modern computers) are little endian machines, here is an easier way:
let flags = byteArray.withUnsafeBytes {
$0.bindMemory(to: UInt32.self)[0]
}
In summary, in both cases, you had been only preserving byte 0 in your shift-add, because the other shifts all evaluated to 0 due to shifting the bits completely out of the byte. It just so happened that in the first case byte[0] contained the information you needed. In general, it's necessary to first promote the value to the size you need for the result, and then shift it.

How to convert Swift [Data] to char **

I'm currently trying to port my Java Android library to Swift. In my Android library I'm using a JNI wrapper for Jerasure to call following C method
int jerasure_matrix_decode(int k, int m, int w, int *matrix, int row_k_ones, int *erasures, char **data_ptrs, char **coding_ptrs, int size)
I have to admit that I'm relatively new to Swift so some of my stuff might be wrong. In my Java code char **data_ptrs and char **coding_ptrs are actually two dimensional arrays (e.g. byte[][] dataShard = new byte[3][1400]). These two dimensional arrays contain actual video stream data. In my Swift library I store my video stream data in a [Data] array so the question is what is the correct way to convert the [Data] array to the C char ** type.
I already tried some things but none of them worked. Currently I have following conversion logic which gives me a UnsafeMutablePointer<UnsafeMutablePointer?>? pointer (data = [Data])
let ptr1 = ptrFromAddress(p: &data)
ptr1.withMemoryRebound(to: UnsafeMutablePointer<Int8>?.self, capacity: data.count) { pp in
// here pp is UnsafeMutablePointer<UnsafeMutablePointer<Int8>?>?
}
func ptrFromAddress<T>(p:UnsafeMutablePointer<T>) -> UnsafeMutablePointer<T>
{
return p
}
The expected result would be that jerasure is able to restore missing data shards of my [Data] array when calling the jerasure_matrix_decode method but instead it completely messes up my [Data] array and accessing it results in EXC_BAD_ACCESS. So I expect this is completely the wrong way.
Documentation in the jerasure.h header file writes following about data_ptrs
data_ptrs = An array of k pointers to data which is size bytes
Edit:
The jerasure library is defining the data_ptrs like this
#define talloc(type, num) (type *) malloc(sizeof(type)*(num))
char **data;
data = talloc(char *, k);
for (i = 0; i < k; i++) {
data[i] = talloc(char, sizeof(long)*w);
}
So what is the best option to call the jerasure_matrix_decode method from swift? Should I use something different than [Data]?
Possible similar question:
How to create a UnsafeMutablePointer<UnsafeMutablePointer<UnsafeMutablePointer<Int8>>>
A possible solution could be to allocate appropriate memory and fill it with the data.
Alignment
The equivalent to char ** of the C code would be UnsafeMutablePointer<UnsafeMutablePointer<CChar>?> on Swift side.
In the definition of data_ptrs that you show in your question, we see that each data block is to be allocated with malloc.
A property of C malloc is that it does not know which pointer type it will eventually be cast into. Therefore, it guarantees strictest memory alignment:
The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated).
see https://port70.net/~nsz/c/c11/n1570.html#7.22.3
Particularly performance-critical C routines often do not operate byte by byte, but cast to larger numeric types or use SIMD.
So, depending on your internal C library implementation, allocating with UnsafeMutablePointer<CChar>.allocate(capacity: columns) could be problematic, because
UnsafeMutablePointer provides no automated memory management or alignment guarantees.
see https://developer.apple.com/documentation/swift/unsafemutablepointer
The alternative could be to use UnsafeMutableRawPointer with an alignment parameter. You can use MemoryLayout<max_align_t>.alignment to find out the maximum alignment constraint.
Populating Data
An UnsafeMutablePointer<CChar> would have the advantage that we could use pointer arithmetic. This can be achieved by converting the UnsafeMutableRawPointer to an OpaquePointer and then to an UnsafeMutablePointer. In the code it would then look like this:
let colDataRaw = UnsafeMutableRawPointer.allocate(byteCount: cols, alignment: MemoryLayout<max_align_t>.alignment)
let colData = UnsafeMutablePointer<CChar>(OpaquePointer(colDataRaw))
for x in 0..<cols {
colData[x] = CChar(bitPattern: dataArray[y][x])
}
Complete Self-contained Test Program
Your library will probably have certain requirements for the data (e.g. supported matrix dimensions), which I don't know. These must be taken into account, of course. But for a basic technical test we can create an independent test program.
#include <stdio.h>
#include "matrix.h"
void some_matrix_operation(int rows, int cols, char **data_ptrs) {
printf("C side:\n");
for(int y = 0; y < rows; y++) {
for(int x = 0; x < cols; x++) {
printf("%02d ", (unsigned char)data_ptrs[y][x]);
data_ptrs[y][x] += 100;
}
printf("\n");
}
printf("\n");
}
It simply prints the bytes and adds 100 to each byte to be able to verify that the changes arrive on the Swift side.
The corresponding header must be included in the bridge header and looks like this:
#ifndef matrix_h
#define matrix_h
void some_matrix_operation(int rows, int cols, char **data_ptrs);
#endif /* matrix_h */
On the Swift side, we can put everything in a class called Matrix:
import Foundation
class Matrix: CustomStringConvertible {
let rows: Int
let cols: Int
let dataPtr: UnsafeMutablePointer<UnsafeMutablePointer<CChar>?>
init(dataArray: [Data]) {
guard !dataArray.isEmpty && !dataArray[0].isEmpty else { fatalError("empty data not supported") }
self.rows = dataArray.count
self.cols = dataArray[0].count
self.dataPtr = Self.copyToCMatrix(rows: rows, cols: cols, dataArray: dataArray)
}
deinit {
for y in 0..<rows {
dataPtr[y]?.deallocate()
}
dataPtr.deallocate()
}
var description: String {
var desc = ""
for data in dataArray {
for byte in data {
desc += "\(byte) "
}
desc += "\n"
}
return desc
}
var dataArray: [Data] {
var array = [Data]()
for y in 0..<rows {
if let ptr = dataPtr[y] {
array.append(Data(bytes: ptr, count: cols))
}
}
return array
}
private static func copyToCMatrix(rows: Int, cols: Int, dataArray: [Data]) -> UnsafeMutablePointer<UnsafeMutablePointer<CChar>?> {
let dataPtr = UnsafeMutablePointer<UnsafeMutablePointer<CChar>?>.allocate(capacity: rows)
for y in 0..<rows {
let colDataRaw = UnsafeMutableRawPointer.allocate(byteCount: cols, alignment: MemoryLayout<max_align_t>.alignment)
let colData = UnsafeMutablePointer<CChar>(OpaquePointer(colDataRaw))
dataPtr[y] = colData
for x in 0..<cols {
colData[x] = CChar(bitPattern: dataArray[y][x])
}
}
return dataPtr
}
}
You can call it as shown here:
let example: [[UInt8]] = [
[ 126, 127, 128, 129],
[ 130, 131, 132, 133],
[ 134, 135, 136, 137]
]
let dataArray = example.map { Data($0) }
let matrix = Matrix(dataArray: dataArray)
print("before on Swift side:")
print(matrix)
some_matrix_operation(Int32(matrix.rows), Int32(matrix.cols), matrix.dataPtr)
print("afterwards on Swift side:")
print(matrix)
Test Result
The test result is as follows and seems to show the expected result.
before on Swift side:
126 127 128 129
130 131 132 133
134 135 136 137
C side:
126 127 128 129
130 131 132 133
134 135 136 137
afterwards on Swift side:
226 227 228 229
230 231 232 233
234 235 236 237

Positive NSDecimalNumber returns unexpected 64-bit integer values

I've stumbled onto an odd NSDecimalNumber behavior: for some values, invocations of integerValue, longValue, longLongValue, etc., return the an unexpected value. Example:
let v = NSDecimalNumber(string: "9.821426272392280061")
v // evaluates to 9.821426272392278
v.intValue // evaluates to 9
v.integerValue // evaluates to -8
v.longValue // evaluates to -8
v.longLongValue // evaluates to -8
let v2 = NSDecimalNumber(string: "9.821426272392280060")
v2 // evaluates to 9.821426272392278
v2.intValue // evaluates to 9
v2.integerValue // evaluates to 9
v2.longValue // evaluates to 9
v2.longLongValue // evaluates to 9
This is using XCode 7.3; I haven't tested using earlier versions of the frameworks.
I've seen a bunch of discussion about unexpected rounding behavior with NSDecimalNumber, as well as admonishments not to initialize it with the inherited NSNumber initializers, but I haven't seen anything about this specific behavior. Nevertheless there are some rather detailed discussions about internal representations and rounding which may contain the nugget I seek, so apologies in advance if I missed it.
EDIT: It's buried in the comments, but I've filed this as issue #25465729 with Apple. OpenRadar: http://www.openradar.me/radar?id=5007005597040640.
EDIT 2: Apple has marked this as a dup of #19812966.
Since you already know the problem is due to "too high precision", you could workaround it by rounding the decimal number first:
let b = NSDecimalNumber(string: "9.999999999999999999")
print(b, "->", b.int64Value)
// 9.999999999999999999 -> -8
let truncateBehavior = NSDecimalNumberHandler(roundingMode: .down,
scale: 0,
raiseOnExactness: true,
raiseOnOverflow: true,
raiseOnUnderflow: true,
raiseOnDivideByZero: true)
let c = b.rounding(accordingToBehavior: truncateBehavior)
print(c, "->", c.int64Value)
// 9 -> 9
If you want to use int64Value (i.e. -longLongValue), avoid using numbers with more than 62 bits of precision, i.e. avoid having more than 18 digits totally. Reasons explained below.
NSDecimalNumber is internally represented as a Decimal structure:
typedef struct {
signed int _exponent:8;
unsigned int _length:4;
unsigned int _isNegative:1;
unsigned int _isCompact:1;
unsigned int _reserved:18;
unsigned short _mantissa[NSDecimalMaxSize]; // NSDecimalMaxSize = 8
} NSDecimal;
This can be obtained using .decimalValue, e.g.
let v2 = NSDecimalNumber(string: "9.821426272392280061")
let d = v2.decimalValue
print(d._exponent, d._mantissa, d._length)
// -18 (30717, 39329, 46888, 34892, 0, 0, 0, 0) 4
This means 9.821426272392280061 is internally stored as 9821426272392280061 × 10-18 — note that 9821426272392280061 = 34892 × 655363 + 46888 × 655362 + 39329 × 65536 + 30717.
Now compare with 9.821426272392280060:
let v2 = NSDecimalNumber(string: "9.821426272392280060")
let d = v2.decimalValue
print(d._exponent, d._mantissa, d._length)
// -17 (62054, 3932, 17796, 3489, 0, 0, 0, 0) 4
Note that the exponent is reduced to -17, meaning the trailing zero is omitted by Foundation.
Knowing the internal structure, I now make a claim: the bug is because 34892 ≥ 32768. Observe:
let a = NSDecimalNumber(decimal: Decimal(
_exponent: -18, _length: 4, _isNegative: 0, _isCompact: 1, _reserved: 0,
_mantissa: (65535, 65535, 65535, 32767, 0, 0, 0, 0)))
let b = NSDecimalNumber(decimal: Decimal(
_exponent: -18, _length: 4, _isNegative: 0, _isCompact: 1, _reserved: 0,
_mantissa: (0, 0, 0, 32768, 0, 0, 0, 0)))
print(a, "->", a.int64Value)
print(b, "->", b.int64Value)
// 9.223372036854775807 -> 9
// 9.223372036854775808 -> -9
Note that 32768 × 655363 = 263 is the value just enough to overflow a signed 64-bit number. Therefore, I suspect that the bug is due to Foundation implementing int64Value as (1) convert the mantissa directly into an Int64, and then (2) divide by 10|exponent|.
In fact, if you disassemble Foundation.framework, you will find that it is basically how int64Value is implemented (this is independent of the platform's pointer width).
But why int32Value isn't affected? Because internally it is just implemented as Int32(self.doubleValue), so no overflow issue would occur. Unfortunately a double only has 53 bits of precision, so Apple has no choice but to implement int64Value (requiring 64 bits of precision) without floating-point arithmetics.
I'd file a bug with Apple if I were you. The docs say that NSDecimalNumber can represent any value up to 38 digits long. NSDecimalNumber inherits those properties from NSNumber, and the docs don't explicitly say what conversion is involved at that point, but the only reasonable interpretation is that if the number is roundable to and representable as an Int, then you get the correct answer.
It looks to me like a bug in handling the sign-extension during the conversion somewhere, since intValue is 32-bit and integerValue is 64-bit (in Swift).

Hashfunction to map combinations of 5 to 7 cards

Referring to the original problem: Optimizing hand-evaluation algorithm for Poker-Monte-Carlo-Simulation
I have a list of 5 to 7 cards and want to store their value in a hashtable, which should be an array of 32-bit-integers and directly accessed by the hashfunctions value as index.
Regarding the large amount of possible combinations in a 52-card-deck, I don't want to waste too much memory.
Numbers:
7-card-combinations: 133784560
6-card-combinations: 20358520
5-card-combinations: 2598960
Total: 156.742.040 possible combinations
Storing 157 million 32-bit-integer values costs about 580MB. So I would like to avoid increasing this number by reserving memory in an array for values that aren't needed.
So the question is: How could a hashfunction look like, that maps each possible, non duplicated combination of cards to a consecutive value between 0 and 156.742.040 or at least comes close to it?
Paul Senzee has a great post on this for 7 cards (deleted link as it is broken and now points to a NSFW site).
His code is basically a bunch of pre-computed tables and then one function to look up the array index for a given 7-card hand (represented as a 64-bit number with the lowest 52 bits signifying cards):
inline unsigned index52c7(unsigned __int64 x)
{
const unsigned short *a = (const unsigned short *)&x;
unsigned A = a[3], B = a[2], C = a[1], D = a[0],
bcA = _bitcount[A], bcB = _bitcount[B], bcC = _bitcount[C], bcD = _bitcount[D],
mulA = _choose48x[7 - bcA], mulB = _choose32x[7 - (bcA + bcB)], mulC = _choose16x[bcD];
return _offsets52c[bcA] + _table4[A] * mulA +
_offsets48c[ (bcA << 4) + bcB] + _table [B] * mulB +
_offsets32c[((bcA + bcB) << 4) + bcC] + _table [C] * mulC +
_table [D];
}
In short, it's a bunch of lookups and bitwise operations powered by pre-computed lookup tables based on perfect hashing.
If you go back and look at this website, you can get the perfect hash code that Senzee used to create the 7-card hash and repeat the process for 5- and 6-card tables (essentially creating a new index52c7.h for each). You might be able to smash all 3 into one table, but I haven't tried that.
All told that should be ~628 MB (4 bytes * 157 M entries). Or, if you want to split it up, you can map it to 16-bit numbers (since I believe most poker hand evaluators only need 7,462 unique hand scores) and then have a separate map from those 7,462 hand scores to whatever hand categories you want. That would be 314 MB.
Here's a different answer based on the colex function concept. It works with bitsets that are sorted in descending order. Here's a Python implementation (both recursive so you can see the logic and iterative). The main concept is that, given a bitset, you can always calculate how many bitsets there are with the same number of set bits but less than (in either the lexicographical or mathematical sense) your given bitset. I got the idea from this paper on hand isomorphisms.
from math import factorial
def n_choose_k(n, k):
return 0 if n < k else factorial(n) // (factorial(k) * factorial(n - k))
def indexset_recursive(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset_recursive([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset_recursive([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset_recursive([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset_recursive([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
first = bitset[0] - lowest_bit
if m == 1:
return first
else:
t = n_choose_k(first, m)
return t + indexset_recursive(bitset[1:], lowest_bit)
def indexset(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
g = enumerate(bitset)
return sum(n_choose_k(bit - lowest_bit, m - i) for i, bit in g)

How do I make a strided copy from global to local memory?

I want to copy some data from a buffer in the global device memory to the local memory of a processing core - but, with a twist.
I know about async_work_group_copy, and it's nice (or rather, it's klunky and annoying, but working). However, my data is not contiguous - it is strided, i.e. there might be X bytes between every two consecutive Y bytes I want to copy.
Obviously I'm not going to copy all the useless data - and it might not even fit in my local memory. What can I do instead? I want to avoid writing actual kernel code to do the copying, e.g.
threadId = get_local_id(0);
if (threadId < length) {
unsigned offset = threadId * stride;
localData[threadId] = globalData[offset];
}
You can use the async_work_group_strided_copy() OpenCL API call.
Here is a small example in pyopencl thanks to #DarkZeros' comment. Let's assume a small stripe of an RGB image, says 4 by 1 like that:
img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])
and you want to access the four red channels i.e. [58 157 64 214] you'd do:
def test_asyc_copy_stride_to_local(self):
#Create context, queue, program first
....
#number of R channels
nb_of_el = 4
img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])
cl_input = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=img)
#buffer used to check if the copy is correct
cl_output = cl.Buffer(ctx, mf.WRITE_ONLY, size=nb_of_el * np.dtype('int32').itemsize)
lcl_buf = cl.LocalMemory(nb_of_el * np.dtype('int32').itemsize)
prog.asynCopyToLocalWithStride(queue, (nb_of_el,), None, cl_input, cl_output, lcl_buf)
result = np.zeros(nb_of_el, dtype=np.int32)
cl.enqueue_copy(queue, result, cl_output).wait()
print result
The kernel:
kernel void asynCopyToLocalWithStride(global int *in, global int *out, local int *localBuf){
const int idx = get_global_id(0);
localBuf[idx] = 0;
//copy 4 elements, the stride = 3 (RGB)
event_t ev = async_work_group_strided_copy(localBuf, in, 4, 3, 0);
wait_group_events (1, &ev);
out[idx] = localBuf[idx];
}

Resources