I'm new to Rust and tried to "translate" a chess engine I made in Python into Rust. Since Rust is a more low-level programming language than Python, I care about the memory layout of my program, but don't know how to write my code to store the chess board as efficient as possible in memory
My first approach was a struct with a 2-dimensional array (8x8) to represent the chess board. The items are structs Piece, which contaions information about the piece-type (enum) and the color (enum). My code for this is:
#[derive(Debug, Copy, Clone)]
pub enum PieceTypes{
Pawn,
Knight,
Bishop,
Rook,
Queen,
King,
}
#[derive(Debug, Copy, Clone)]
pub enum Colors{
White,
Black,
}
#[derive(Debug, Copy, Clone)]
pub struct Piece{
pub piece_typ: PieceTypes,
pub color:Colors,
}
pub struct Board{
pub squares:[[Option<Piece>; 8]; 8],
}
But the size of an initialized Board-Struct in memory is 128 bytes. 16 bytes per row, 2 bytes per square.This is, of course, perfetly fine, but 8 different pieces with 2 different colors (16 possibilities) definitely don't require 2 bytes (65536 possibilities). Probably this is because the piece type and the color are stored in seperate bytes.
As the beginner I am I wonder how to layout my code so one square uses one byte in memory (Also: Because I am, as mentioned, quite a noob in Rust, explicit code examples would be really helpful)
You can use my library superbitty instead of manual bitfields:
#[derive(Debug, Copy, Clone, superbitty::BitFieldCompatible)]
pub enum PieceTypes {
None,
Pawn,
Knight,
Bishop,
Rook,
Queen,
King,
}
#[derive(Debug, Copy, Clone, superbitty::BitFieldCompatible)]
pub enum Colors {
White,
Black,
}
superbitty::bitfields! {
#[derive(Debug, Copy, Clone)]
pub struct Piece : u8 {
pub piece_typ: PieceTypes,
pub color: Colors,
}
}
#[derive(Debug)]
pub struct Board {
pub squares: [[Piece; 8]; 8],
}
fn main() {
assert_eq!(std::mem::size_of::<Board>(), 64);
let mut board = Board { squares: [[Piece::new(PieceTypes::None, Colors::White); 8]; 8] };
board.squares[0][0].set_piece_typ(PieceTypes::Queen);
board.squares[0][1] = Piece::new(PieceTypes::King, Colors::Black);
dbg!(board.squares[0][1].piece_typ(), board.squares[0][1].color());
dbg!(board);
}
However, I'd recommend you to not bother, unless you benchmarked and can prove a performance gain. This can actually hurt your performance.
Related
I have this code:
struct Point {
pub x: f64,
pub y: f64,
pub z: f64,
}
fn main() {
let p = Point {
x: 1.0,
y: 2.0,
z: 3.0,
};
println!("{:p}", &p);
println!("{:p}", &p.x); // To make sure I'm seeing the struct address and not the variable address. </paranoid>
let b = p;
println!("{:p}", &b);
}
Possible output:
0x7ffe631ffc28
0x7ffe631ffc28
0x7ffe631ffc90
I'm trying to understand what happens when doing let b = p. I know that, if p holds a primitive type or any type with the Copy or Clone traits, the value or struct is copied into the new variable. In this case, I have not defined any of those traits in the Point structure, so I expected that b should take the ownership of the struct and no copy should be made.
How is it possible p and b to have different memory addresses? Are the struct moved from one address to another? Is it implicitly copied? Shouldn't be more efficient to just make b own the data that has already been allocated when creating the structure, and therefore maintaining the same address?
You are experiencing the observer effect: by taking a pointer to these fields (which happens when you format a reference with {:p}) you have caused both the compiler and the optimizer to alter their behavior. You changed the outcome by measuring it!
Taking a pointer to something requires that it be in addressable memory somewhere, which means the compiler couldn't put b or p in CPU registers (where it prefers to put stuff when possible, because registers are fast). We haven't even gotten to the optimization stage but we've already affected decisions the compiler has made about where the data needs to be -- that's a big deal that limits what the optimizer can do.
Now the optimizer has to figure out whether the move can be elided. Taking pointers to b and p could prevent the optimizer from doing so, but it may not. It's also possible that you're just compiling without optimizations.
Note that even if Point were Copy, if you removed all of the pointer printing, the optimizer may even elide the copy if it can prove that p is either unused on the other side of the copy, or neither value is mutated (which is a pretty good bet since neither are declared mut).
Here's the rule: don't ever try to determine what the compiler or optimizer does with your code from within that code -- doing so may actually subvert the optimizer and lead you to a wrong conclusion. This applies to every language, not just Rust.
The only valid approach is to look at the generated assembly.
So let's do that!
I used your code as a starting point and wrote two different functions, one with the move and one without:
#![feature(bench_black_box)]
struct Point {
pub x: f64,
pub y: f64,
pub z: f64,
}
#[inline(never)]
fn a() {
let p = Point {
x: 1.0,
y: 2.0,
z: 3.0,
};
std::hint::black_box(p);
}
#[inline(never)]
fn b() {
let p = Point {
x: 1.0,
y: 2.0,
z: 3.0,
};
let b = p;
std::hint::black_box(b);
}
fn main() {
a();
b();
}
A few things to point out before we move on to look at the assembly:
std::hint::black_box() is an experimental function whose purpose is to act as a, well, black box to the optimizer. The optimizer is not allowed to look into this function to see what it does, therefore it cannot optimize it away. Without this, the optimizer would look at the body of the function and correctly conclude that it doesn't do anything at all, and eliminate the whole thing as a no-op.
We mark the two functions as #[inline(never)] to ensure that the optimizer won't inline both functions into main(). This makes them easier to compare to each other.
So we should get two functions out of this and we can compare their assembly.
But we don't get two functions.
In the generated assembly, b() is nowhere to be found. So what happened instead? Let's look to see what main() does:
pushq %rax
callq playground::a
popq %rax
jmp playground::a
Well... would you look at that. The optimizer figured out that two functions are semantically equivalent, despite one of them having an additional move. So it decided to completely eliminate b() and make it an alias for a(), resulting in two calls to a()!
Out of curiosity, I changed the literal f64 values in b() to prevent the functions from being unified and saw what I expected to see: other than the different values, the emitted assembly was identical. The compiler elided the move.
(Playground -- note that you need to manually press the three-dots button next to "run" and select the "ASM" option.)
I'm working on a drawing engine using Metal. I am reworking from a previous version, so starting from scratch
I was getting error Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)
After some debugging I placed the blame to my drawPrimitives routine, I found the case quite interesting
I will have a variety of brushes, all of them will work with specific Vertex info
So I said, why not? Have all the brushes respond to a protocol
The protocol for the Vertices will be this:
protocol MetalVertice {}
And the Vertex info used by this specific brush will be:
struct PointVertex:MetalVertice{
var pointId:UInt32
let relativePosition:UInt32
}
The brush can be called either by giving Vertices previously created or by calling a function to create those vertices. Anyway, the real drawing happens at the vertice function
var vertices:[PointVertex] = [PointVertex].init(repeating: PointVertex(pointId: 0,
relativePosition: 0),
count: totalVertices)
for (verticeIdx, pointIndex) in pointsIndices.enumerated(){
vertices[verticeIdx].pointId = UInt32(pointIndex)
}
for vertice in vertices{
print("size: \(MemoryLayout.size(ofValue: vertice))")
}
self.renderVertices(vertices: vertices,
forStroke: stroke,
inDrawing: drawing,
commandEncoder: commandEncoder)
return vertices
}
func renderVertices(vertices: [MetalVertice], forStroke stroke: LFStroke, inDrawing drawing:LFDrawing, commandEncoder: MTLRenderCommandEncoder) {
if vertices.count > 1{
print("vertices a escribir: \(vertices.count)")
print("stride: \(MemoryLayout<PointVertex>.stride)")
print("size of array \(MemoryLayout.size(ofValue: vertices))")
for vertice in vertices{
print("ispointvertex: \(vertice is PointVertex)")
print("size: \(MemoryLayout.size(ofValue: vertice))")
}
}
let vertexBuffer = LFDrawing.device.makeBuffer(bytes: vertices,
length: MemoryLayout<PointVertex>.stride * vertices.count,
options: [])
This was the issue, calling this specific code produces these results in the console:
size: 8
size: 8
vertices a escribir: 2
stride: 8
size of array 8
ispointvertex: true
size: 40
ispointvertex: true
size: 40
In the previous function, the size of the vertices is 8 bytes, but for some reason, when they enter the next function they turn into 40 bytes, so the buffer is incorrectly constructed
if I change the function signature to:
func renderVertices(vertices: [PointVertex], forStroke stroke: LFStroke, inDrawing drawing:LFDrawing, commandEncoder: MTLRenderCommandEncoder) {
The vertices are correctly reported as 8 bytes long and the draw routine works as intended
Anything I'm missing? if the MetalVertice protocol introducing some noise?
In order to fulfill the requirement that value types conforming to protocols be able to perform dynamic dispatch (and also in part to ensure that containers of protocol types are able to assume that all of their elements are of uniform size), Swift uses what are called existential containers to hold the data of protocol-conforming value types alongside metadata that points to the concrete implementations of each protocol. If you've heard the term protocol witness table, that's what's getting in your way here.
The particulars of this are beyond the scope of this answer, but you can check out this video and this post for more info.
The moral of the story is: don't assume that Swift will lay out out your structs as-written. Swift can reorder struct members and add padding or arbitrary metadata, and it gives you practically no control over this. Instead, declare the structs you need to use in your Metal code in a C or Objective-C file and import them via a bridging header. If you want to use protocols to make it easier to address your structs polymorphically, you need to be prepared to copy them member-wise into your regular old C structs and prepared to pay the memory cost that that convenience entails.
I'm trying to improve the performance of a rust program, which requires me to reduce the size of some large enums. For example
enum EE {
A, // 0
B(i32), //4
C(i64), // 8
D(String), // 24
E { // 16
x: i64,
y: i32,
},
}
fn main() {
println!("{}", std::mem::size_of::<EE>()); // 32
}
prints 32. But if I want to know the size of EE::A, I get a compile error
error[E0573]: expected type, found variant `EE::A`
--> src/main.rs:14:40
|
14 | println!("{}", std::mem::size_of::<EE::A>());
| ^^^^^
| |
| not a type
| help: try using the variant's enum: `crate::EE`
error: aborting due to previous error
error: could not compile `play_rust`.
Is there a way to find out which variant takes the most space?
No, there is no way to get the size of just one variant of an enum. The best you can do is get the size of what the variant contains, as if it were a standalone struct:
println!("sizeof EE::A: {}", std::mem::size_of::<()>()); // 0
println!("sizeof EE::B: {}", std::mem::size_of::<i32>()); // 4
println!("sizeof EE::C: {}", std::mem::size_of::<i64>()); // 8
println!("sizeof EE::D: {}", std::mem::size_of::<String>()); // 24
println!("sizeof EE::E: {}", std::mem::size_of::<(i64, i32)>()); // 16
Even this isn't especially useful because it includes padding bytes that may be used to store the tag; as you point out, the size of the enum can be reduced to 16 if D is shrunk to a single pointer, but you can't know that from looking at just the sizes. If y were instead defined as i64, the size of each variant would be the same, but the size of the enum would need to be 24. Alignment is another confounding factor that makes the size of an enum more complex than just "the size of the largest variant plus the tag".
Of course, this is all highly platform-dependent, and your code should not rely on any enum having a particular layout (unless you can guarantee it with a #[repr] annotation).
If you have a particular enum you're worried about, it's not difficult to get the size of each contained type. Clippy also has a lint for enums with extreme size differences between variants. However, I don't recommend using size alone to make manual optimizations to enum layouts, or boxing things that are only a few pointers in size -- indirection suppresses other kinds of optimizations the compiler may be able to do. If you prioritize minimal space usage you may accidentally make your code much slower in the process.
I'm looking for an effective way to extract lower 64 bit integer from __m128i on AMD Piledriver. Something like this:
static inline int64_t extractlo_64(__m128i x)
{
int64_t result;
// extract into result
return result;
}
Instruction tables say that common approach - using _mm_extract_epi64() - is ineffective on this processor. It generates PEXTRQ instruction which has a latency of 10 cycles (compared to 2-3 cycles in Intel processors).
Is there any better way to do this?
On x86-64 you can use _mm_cvtsi128_si64, which translates to a single MOVQ r64, xmm instruction
One possibility might be to use MOVDQ2Q, which has a latency of 2 instructions on Piledriver:
static inline int64_t extractlo_64(const __m128i v)
{
return _m_to_int64(_mm_movepi64_pi64(v)); // MOVDQ2Q + MOVQ
}
I am testing the alignment and i determine something strange with iOS simulator.(XCode 4.3.2 and XCode 4.5).
On iOS simulator, structures are aligned to 8 byte boundary even when attribute ((aligned (4))) is used to force 4 byte boundary. Check that its padded with 0x00000001 at the end to align 8 byte boundary.
If myStruct variable defined in global scope then simulator aligns it to 4-byte boundary, so it may be something related to stack.
Simulator is i386 so its 32-bit and it must be aligning to 4-byte boundary. So, what would be the reason, why is it aligning to 64-bit boundary? Is it a feature or a bug?
(I know it is not necessary to struggle with simulator but it may cause to stuck into subtle problems.)
typedef struct myStruct
{
int a;
int b;
} myStruct;
//} __attribute__ ((aligned (4))) myStruct;
-(void)alignmentTest
{
// Offset 16*n (0x2fdfe2f0)
int __attribute__ ((aligned (16))) force16ByteBoundary = 0x01020304;
// Offset 16*n-4 (0x2fdfe2ec)
int some4Byte = 0x09080706;
// Offset 16*n-12 (0x2fdfe2e4)
myStruct mys;
mys.a = 0xa1b1c1d1;
mys.b = 0xf2e28292;
NSLog(#"&force16ByteBoundary: %p / &some4Byte: %p / &mys: %p",
&force16ByteBoundary, &some4Byte, &mys);
}
(EDIT Optimizations are off, -O0)
Simulator(iOS 5.1) results;
(lldb) x `&mys` -fx
0xbfffda60: 0xa1b1c1d1 0xf2e28292 0x00000001 0x09080706
0xbfffda70: 0x01020304
&force16ByteBoundary: 0xbfffda70 / &some4Byte: 0xbfffda6c / &mys:
0xbfffda60
Device(iOS 5.1) results;
(lldb) x `&mys` -fx
0x2fdfe2e4: 0xa1b1c1d1 0xf2e28292 0x09080706 0x01020304
&force16ByteBoundary: 0x2fdfe2f0 / &some4Byte: 0x2fdfe2ec / &mys:
0x2fdfe2e4
(NEW FINDINGS)
- On Simulator and Device;
- Building for Release or Debug does not make any difference for alignments.
- Local or global variables of "long long", double types are aligned to 8 byte boundary although they must be aligned to 4 byte boundary.
- There is no problem with global variables of structs.
- On Simulator;
- Local variables of structs are aligned to 8 byte boundary even when there is only a char member in the struct.
(EDIT)
I could only find out the "Data Types and Data Alignment" for iOS here.
(Also, they could be inferred from ILP32 alignments here.)
Typically, the alignment attributes only affect the relative alignment of items inside a struct. This allows backward-compatibility with code that wants to just bulk-copy data into a structure directly from the network or binary file.
The alignment attributes won't affect the alignment of local variables allocated on the stack. The ordering and alignment of items on the stack is not guaranteed and will generally be aligned optimally for each item for the device. So, if a 386-based device can fetch a 64-bit long-long from memory in a single operation by 8-byte aligning them, it will do so. Some processors actually lose dramatic amounts of performance if data is not fully aligned. Some processors can throw exceptions for attempting to read data that is not properly aligned.