What are the performance implications of these C# features? - xna

I have been designing a component-based game library, with the overall intention of writing it in C++ (as that is my forte), with Ogre3D as the back-end. Now that I am actually ready to write some code, I thought it would be far quicker to test out my framework under the XNA4.0 framework (somewhat quicker to get results/write an editor, etc). However, whilst I am no newcomer to C++ or C#, I am a bit of a newcomer when it comes to doing things the "XNA" way, so to speak, so I had a few queries before I started hammering out code:
I read about using arrays rather than collections to avoid performance hits, then also read that this was not entirely true and that if you enumerated over, say, a concrete List<> collection (as opposed to an IEnumerable<>), the enumerator is a value-type that is used for each iteration and that there aren't any GC worries here. The article in question was back in 2007. Does this hold true, or do you experienced XNA developers have real-world gotchas about this? Ideally I'd like to go down a chosen route before I do too much.
If arrays truly are the way to go, no questions asked, I assume when it comes to resizing the array, you copy the old one over with new space? Or is this off the mark? Do you attempt to never, ever resize an array? Won't the GC kick in for the old one if this is the case, or is the hit inconsequential?
As the engine was designed for C++, the design allows for use of lambdas and delegates. One design uses the fastdelegate library which is the fastest possible way of using delegates in C++. A more flexible, but slightly slower approach (though hardly noticeable in the world of C++) is to use C++0x lambdas and std::function. Ideally, I'd like to do something similar in XNA, and allow delegates to be used. Does the use of delegates cause any significant issues with regard to performance?
If there are performance considerations with regards to delegates, is there a difference between:
public void myDelegate(int a, int b);
private void myFunction(int a, int b)
{
}
event myDelegate myEvent;
myEvent += myFunction;
vs:
public void myDelegate(int a, int b);
event myDelegate myEvent;
myEvent += (int a, int b) => { /* ... */ };
Sorry if I have waffled on a bit, I prefer to be clear in my questions. :)
Thanks in advance!

Basically the only major performance issue to be aware of in C# that is different to what you have to be aware of in C++, is the garbage collector. Simply don't allocate memory during your main game loop and you'll be fine. Here is a blog post that goes into detail.
Now to your questions:
1) If a framework collection iterator could be implemented as a value-type (not creating garbage), then it usually (always?) has been. You can safely use foreach on, for example, List<>.
You can verify if you are allocating in your main loop by using the CLR Profiler.
2) Use Lists instead of arrays. They'll handle the resizing for you. You should use the Capacity property to pre-allocate enough space before you start gameplay to avoid GC issues. Using arrays you'd just have to implement all this functionality yourself - ugly!
The GC kicks in on allocations (not when memory becomes free). On Xbox 360 it kicks in for every 1MB allocated and is very slow. On Windows it is a bit more complicated - but also doesn't have such a huge impact on performance.
3) C# delegates are pretty damn fast. And faster than most people expect. They are about on-par with method calls on interfaces. Here and here are questions that provide more detials about delegate performance in C#.
I couldn't say how they compare to the C++ options. You'd have to measure it.
4) No. I'm fairly sure this code will produce identical IL. You could disassemble it and check, or profile it, though.
I might add - without checking myself - I suspect that having an event myDelegate will be slower than a plain myDelegate if you don't need all the magic of event.

Related

Would there be a practical application for a more memory efficient boolean?

I've noticed that booleans occupy a whole byte, despite only needing 1 bit. I was wondering whether we could have something like
struct smartbool{char data;}
, which would store 8 booleans at once.
I am aware that it would take more time to retrieve data, although would the tradeoff be a practical application in some scenarios?
Am I missing something about the memory usage of booleans?
Normally variables are aligned on word boundaries, memory use is balanced against efficiency of access. For one-off boolean variables it may not make sense to store them in a denser form.
If you do need a bunch of booleans you can use things like this BitSet data structure: https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/BitSet.html.
There is a type of database index that stores booleans efficiently:
https://en.wikipedia.org/wiki/Bitmap_index. The less space an index takes up the easier it is to keep in memory.
There are already widely used data types that support multiple booleans, they are called integers. you can store and retrieve multiple booleans in an integral type, using bitwise operations, screening out the bits you don't care about with a pattern of bits called a bitmask.
This sort of "packing" is certainly possible and sometimes useful, as a memory-saving optimization. Many languages and libraries provide a way to make it convenient, e.g. std::vector<bool> in C++ is meant to be implemented this way.
However, it should be done only when the programmer knows it will happen and specifically wants it. There is a tradeoff in speed: if bits are used, then setting / clearing / testing a specific bool requires first computing a mask with an appropriate shift, and setting or clearing it now requires a read-modify-write instead of just a write.
And there is a more serious issue in multithreaded programs. Languages like C++ promise that different threads can freely modify different objects, including different elements of the same array, without needing synchronization or causing a data race. For instance, if we have
bool a, b; // not atomic
void thread1() { /* reads and writes a */ }
void thread2() { /* reads and writes b */ }
then this is supposed to work fine. But if the compiler made a and b two different bits in the same byte, concurrent accesses to them would be a data race on that byte, and could cause incorrect behavior (e.g. if the read-modify-writes being done by the two threads were interleaved). The only way to make it safe would be to require that both threads use atomic operations for all their accesses, which are typically many times slower. And if the compiler could freely pack bools in this way, then every operation on a potentially shared bool would have to be made atomic, throughout the entire program. That would be prohibitively expensive.
So this is fine if the programmer wants to pack bools to save memory, is willing to take the hit to speed, and can guarantee that they won't be accessed concurrently. But they should be aware that it's happening, and have control over whether it does.
(Indeed, some people feel that having C++ provide this with vector<bool> was a mistake, since programmers have to know that it is a special exception to the otherwise general rule that vector<T> behaves like an array of T, and different elements of the vector can safely be accessed concurrently. Perhaps they should have left vector<bool> to work in the naive way, and given a different name to the packed version, similar to std::bitset.)

Using something other than a Swift array for mutable fixed-size thread-safe data passed to OpenGL buffer

I am trying to squeeze every bit of efficiency out of my application I am working on.
I have a couple arrays that follow the following conditions:
They are NEVER appended to, I always calculate the index myself
The are allocated once and never change size
It would be nice if they were thread safe as long as it doesn't cost performance
Some hold primitives like floats, or unsigned ints. One of them does hold a class.
Most of these arrays at some point are passed into a glBuffer
Never cleared just overwritten
Some of the arrays individual elements are changed entirely by = others are changed by +=
I currently am using swift native arrays and am allocating them like var arr = [GLfloat](count: 999, repeatedValue: 0) however I have been reading a lot of documentation and it sounds like Swift arrays are much more abstract then a traditional C-style array. I am not even sure if they are allocated in a block or more like a linked list with bits and pieces thrown all over the place. I believe by doing the code above you cause it to allocate in a continuous block but i'm not sure.
I worry that the abstract nature of Swift arrays is something that is wasting a lot of precious processing time. As you can see by my above conditions I dont need any of the fancy appending, or safety features of Swift arrays. I just need it simple and fast.
My question is: In this scenario should I be using some other form of array? NSArray, somehow get a C-style array going, create my own data type?
Im looking into thread safety, would a different array type that was more thread safe such as NSArray be any slower?
Note that your requirements are contradictory, particularly #2 and #7. You can't operate on them with += and also say they will never change size. "I always calculate the index myself" also doesn't make sense. What else would calculate it? The requirements for things you will hand to glBuffer are radically different than the requirements for things that will hold objects.
If you construct the Array the way you say, you'll get contiguous memory. If you want to be absolutely certain that you have contiguous memory, use a ContiguousArray (but in the vast majority of cases this will give you little to no benefit while costing you complexity; there appear to be some corner cases in the current compiler that give a small advantage to ContinguousArray, but you must benchmark before assuming that's true). It's not clear what kind of "abstractness" you have in mind, but there's no secrets about how Array works. All of stdlib is open source. Go look and see if it does things you want to avoid.
For certain kinds of operations, it is possible for other types of data structures to be faster. For instance, there are cases where a dispatch_data is better and cases where a regular Data would be better and cases where you should use a ManagedBuffer to gain more control. But in general, unless you deeply know what you're doing, you can easily make things dramatically worse. There is no "is always faster" data structure that works correctly for all the kinds of uses you describe. If there were, that would just be the implementation of Array.
None of this makes sense to pursue until you've built some code and started profiling it in optimized builds to understand what's going on. It is very likely that different uses would be optimized by different kinds of data structures.
It's very strange that you ask whether you should use NSArray, since that would be wildly (orders of magnitude) slower than Array for dealing with very large collections of numbers. You definitely need to experiment with these types a bit to get a sense of their characteristics. NSArray is brilliant and extremely fast for certain problems, but not for that one.
But again, write a little code. Profile it. Look at the generated assembler. See what's happening. Watch particularly for any undesired copying or retain counting. If you see that in a specific case, then you have something to think about changing data structures over. But there's no "use this to go fast." All the trade-offs to achieve that in the general case are already in Array.

Instantiation within loop - primitives and objects

To make this language agnostic let's pseudo code something along the lines of:
for(int i=0;i<=N;i++){
double d=0;
userDefinedObject o=new userDefinedObject();
//effectively do something useful
o.destroy();
}
Now, this may get into deeper details between Java/C++/Python etc, but:
1 - Is doing this with primitives wrong or just sort of ugly/overkill (d could be defined above, and set to 0 in each iteration if need be).
2 - Is doing this with an object actually wrong? Now,I know Java will take care of the memory but for C++ let's assume we have a proper destructor that we call.
Now - the question is quite succinct - is this wrong or just a matter of taste?
Thank you.
Java Garbage Collector will take care of any memory allocation without a reference, which means that if you instantiate on each iteration, you will allocate new memory and lose the reference to the previous one. Said this, you can conclude that the GC will take care of the non-referenced memory, BUT you also have to consider the fact that memory allocation, specifically object initialization takes time and process. So, if you do this on a small program, you're probably not going to feel anything wrong. But let say you're working with something like Bitmap, the allocation will totally own your memory.
For both cases, I'd say it is a matter of taste, but in a real life project, you should be totally sure that you need to initialize within a loop

high performance buffers in objective-c

I'm wondering what the most applicable kind of buffer implementation is for audio data in objective-c. I'm working with audio data on the iPhone, where I do some direct data manipulation/DSP of the audio data while recording or playing, so performance matters. I do iPhone development since some months now. Currently I'm dealing with c-arrays of element type SInt16 or Float32, but I'm looking for something better.
AFAIK, the performance of pointer-iterated c-arrays is unbeatable in an objective-c environment. However, pointer arithmetic and c-arrays are error prone. You always have to make sure that you do not access the arrays out of their bounds. You will not get a runtime error immediately if you do. And you have to make sure manually that you alloc and dealloc the arrays correctly.
Thus, I'm looking for alternatives. What high performance alternatives are there? Is there anything in objective-c similar to the c++ style std::vector?
With similar I mean:
good performance
iteratable with pointer-/iterator-based loop
no overhead of boxing/unboxing basic data types like Float32 or SInt16 into objective-c objects (btw, what's the correct word for 'basic data types' in objective-c?)
bounds-checking
possibility to copy/read/write chunks of other lists or arrays into and out of my searched-for list implementation
memory management included
I've searched and read quite a bit and of course NSData and NSMutableArray are among the mentioned solutions. However don't they double processing cost because of the overhead for the boxing/unboxing of basic data types? That the code looks outright ugly like a simple 'set'-operation becoming some dinosaur named replaceObjectAtIndex:withObject: isn't of my concern, but still it subtly makes me think that this class is not made for me.
NSMutableData hits one of your requirements in that it brings Objective-C memory management semantics to plain C buffers. You can do something like this:
NSMutableData* data = [NSMutableData dataWithLength: sizeof(Float32) * numberOfFloats];
Float32* cFloatArray = (Float32*)[data mutableBytes];
And you can then treat cFloatArray as a standard C array and use pointer iteration. When the NSMutableData object is dealloc'ed the memory backing it will be freed. It doesn't give you bounds checking, but it delivers memory management help while preserving the performance of C arrays.
Also, if you want some help from the tools in ironing out bounds-checking issues read up on Xcode's Malloc Scribble, Malloc Guard Edges and Guard Malloc options. These will make the runtime much more sensitive to bounds problems. Not useful in production, but can be helpful in ironing out issues during development.
The containers provided in the Foundation framework have little to offer for audio processing, being on the whole rather heavy-weight, nor providing extrinsic iterators.
Furthermore, none of the audio APIs in iOS or MacOSX that interact with buffers of samples are Objective-C - based, or take any parameters of Foundation framework containers.
Most likely, you would want to make use of the Accelerate Framework for DSP operations, and its APIs all work on arrays of floats or int16s.
Whilst all of the APIs are C-style, C++ and STL is the obvious weapon of choice for your requirements, and interworks cleanly with the rest of an application in the guise of Objective-C++. STL frequently compiles down to code which is about as efficient as hand-crafted C.
To memory-manage your buffers, perhaps use std::array - if you want bounds checking or std::shared_ptr or std::unique_ptr with a custom deleter if you're not worried.
Places where an iterator is expected - for instance algorithm functions in <algorithm> - can usually also take pointers to basic types - such as your sample buffers.

How does Objective-C do reference counting efficiently?

I'm taking a college course about compilers and we just finished talking about garbage collection and ways to free memory. However, in class lectures and in our textbook, I was led to believe that reference counting was not a great way to manage memory.
The reasoning was that that reference counting is very expensive because the program has to insert numerous additional instructions to increment and decrement the reference count. Additionally, everytime the reference count changes, the program has to check if it equals zero and if so, reclaim the memory.
My textbook even has the sentence: "On the whole, the problems with reference counting outweight its advantages and it is rarely used for automatic storage management in programming language environments.
My questions are: Are these legitamate concerns? Does objective-c avoid them somehow? If so how?
Reference counting does have meaningful overhead, it's true. However, the "classic textbook" solution of tracing garbage collectors are not without downsides as well. The biggest one is nondeterminism, but pausing vs throughput is a significant concern as well.
In the end though, ObjC doesn't really get a choice. A state of the art copying collector requires certain properties of the language (no raw pointers for example) that ObjC just doesn't have. As a result, trying to apply the textbook solution to ObjC ends up requiring a partially conservative, non-copying collector, which in practice is around the same speed as refcounting but without its deterministic behavior.
(edit) My personal feelings are that throughput is a secondary, or even tertiary, concern and that the really important debate comes down to deterministic behavior vs cycle collection and heap compaction by copying. All three of those are such valuable properties that I'd be hard-pressed to pick one.
The consensus on RC vs. tracing in computer science research has been, for a long time, that tracing has superior CPU throughput despite longer (maximum) pause times. (E.g. see here, here, and here.) Only very recently, in 2013, has there been a paper (last link under those three) presenting an RC based system that performs equally or a little better than the best tested tracing GC, with regard to CPU throughput. Needless to say it has no "real" implementations yet.
Here is a tiny benchmark I just did on my iMac with 3.1 GHz i5, in the iOS 7.1 64-bit simulator:
long tenmillion = 10000000;
NSTimeInterval t;
t = [NSDate timeIntervalSinceReferenceDate];
NSMutableArray *arr = [NSMutableArray arrayWithCapacity:tenmillion];
for (long i = 0; i < tenmillion; ++i)
[arr addObject:[NSObject new]];
NSLog(#"%f seconds: Allocating ten million objects and putting them in an array.", [NSDate timeIntervalSinceReferenceDate] - t);
t = [NSDate timeIntervalSinceReferenceDate];
for (NSObject *obj in arr)
[self doNothingWith:obj]; // Can't be optimized out because it's a method call.
NSLog(#"%f seconds: Calling a method on an object ten million times.", [NSDate timeIntervalSinceReferenceDate] - t);
t = [NSDate timeIntervalSinceReferenceDate];
NSObject *o;
for (NSObject *obj in arr)
o = obj;
NSLog(#"%f seconds: Setting a pointer ten million times.", [NSDate timeIntervalSinceReferenceDate] - t);
With ARC disabled (-fno-objc-arc), this gives the following:
2.029345 seconds: Allocating ten million objects and putting them in an array.
0.047976 seconds: Calling a method on an object ten million times.
0.006162 seconds: Setting a pointer ten million times.
With ARC enabled, that becomes:
1.794860 seconds: Allocating ten million objects and putting them in an array.
0.067440 seconds: Calling a method on an object ten million times.
0.788266 seconds: Setting a pointer ten million times.
Apparently allocating objects and calling methods became somewhat cheaper. Assigning to an object pointer became more expensive by orders of magnitude, though don't forget that I didn't call -retain in the non-ARC example, and note that you can use __unsafe_unretained should you ever have a hotspot that assigns object pointers like crazy. Nevertheless, if you want to "forget about" memory management and let ARC insert retain/release calls where ever it wants, you will, in the general case, be wasting lots of CPU cycles, repeatedly and in all code pathes that set pointers. A tracing GC on the other hand leaves your code itself alone, and only kicks in at select moments (usually when allocating something), doing its thing in one fell swoop. (Of course the details are a lot more complicated in truth, given generational GC, incremental GC, concurrent GC, etc.)
So yes, since Objective-C's RC uses atomic retain/release, it is rather expensive, but Objective-C also has many more inefficiencies than that imposed by refcounting. (For instance, the fully dynamic/reflective nature of methods, which can be "swizzled" at any time by at run-time, prevents the compiler from doing many cross-method optimizations that would require data flow analysis and such. An objc_msgSend() is always a call to a "dynamically linked" black box from the view of the static analyzer, so to say.) All in all Objective-C as a language is not exactly the most efficient or best optimizable out there; people call it "C's type safety with Smalltalk's blazing speed" for a reason. ;-)
When writing Objective-C, one generally just instruments around well-implemented Apple libraries, which surely use C and C++ and assembly or whatever for their hotspots. Your own code barely ever needs to be efficient. When there is a hot spot, you can make it very efficient by dropping down to lower level constructs like pure C-style code within a single Objective-C method, but one rarely ever needs this. That's why Objective-C can afford the cost of ARC in the general case. I'm not yet convinced that tracing GC has any inherent problems in memory-constrained environments and think one could use a properly high-level language to instrument said libraries just as well, but apparently RC sits better with Apple/iOS. One has to consider the whole of the framework they've built up so far and all their legacy libraries when asking oneself why they didn't go with a tracing GC; for instance I've heard that RC is rather deeply built into CoreFoundation.
On the whole, the problems with reference counting outweight its advantages and it is rarely used for automatic storage management in programming language environments
The tricky word is automatic
Manual reference counting which is the traditional Obj-C way to do things, avoids the problems by delegating them to the programmer. The programmer has to know about the reference counting and manually add retain and release calls. If he/she creates a reference cycle, he/she is responsible for solving it.
The modern automatic reference counting does a lot of things for the programmer but still it's not a transparent storage management. The programmer still has to know about reference counting, still has to solve the reference cycles.
What's really tricky is to create a framework which handles memory managment by reference counting transparently, that is, without the need for the programmer to know about it. That's why it isn't used for automatic storage management.
The performance loss caused by additional instructions is not very big and usually it's not important.

Resources