IPP - Intel Performance Integrated Primitives - intel-ipp

Are there any requirements that are made by IPP on the data that it handles like alignment and so forth?

On Windows, with x86 binaries (that's all I used so far), you don't need to align memory, but the userguide_win_ia32.pdf doc has a brief mention of alignment in Chapter 7 (from IPP v6.1):
Memory Alignment
The performance of Intel IPP functions can be significantly different
when operating on aligned or misaligned data. Access to memory is faster
if pointers to the data are aligned. Use the following Intel IPP
functions for pointer alignment, memory allocation and deallocation:
With some of the referenced functions: [see ippMalloc, ippAlignPtr, ippFree, etc...].
The bottom line is, alignment can help with performance, but you won't know to what extent until you profile with and without alignment.

Related

Is unaligned memory access allowed on iOS devices?

I'm currently working on app that loads blob of tightly packed data which contains different integer types (sized from char to int) that might not be properly aligned.
So, can I use simple *(short*)ptr or similar accesses to that data? Test on my iphone 5 shows no problem with that, but I'm not sure about all cases on all newer processors.
I did find some related informations, like this:
ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.
but in case of words it seems that on 32-bit and 64-bit ARMs word 32 and 64 bit accordingly, which would mean short requires proper alignment on 64-bit machine.
So, can I assume this is safe, or should I use some keywords like __packed?
Or should I rather avoid it completely and recreate my data so it always have proper alignment (or always use memmove when data is from external source and cannot by permanently modified)?
It's ages ago that I tried it. And it worked, but every single access to unaligned memory caused a trap, which took considerable time. I'd suggest you measure how long it takes to add a million aligned shorts vs a million unaligned shorts. If you have a few hundred or thousand unaligned numbers, nothing to worry about.
__packed works reasonably fast. ARM has some clever instructions to do unaligned access with very few instructions. Again, I'd measure how long that takes. My experience with this is not current.

DirectX, GDI+ or SDI on Windows XP?

If I want to do scaling and compositing of 2D anti-aliased vector and bitmap images in real-time on Windows XP and later versions of Windows, making the best use of hardware acceleration available, should I be using GDI+ or DirectX 9.0c? (Actually, Windows XP and Windows 7 are important but we're not concerned about performance on Vista.)
Is there any merit in using SDL, given that the application is not cross-platform (and never will be)? I wonder if SDL might make it easier to switch to whichever underlying drawing API gives better performance…
Where can I find the documentation for doing scaling and compositing of 2D images in DirectX 9.0c? (I found the documentation for DirectDraw but read that it is deprecated after DirectX 7. But Direct2D is not available until DirectX 10.)
Can I reasonably expect scaling and compositing to be hardware accelerated on Windows XP on a mid- to low-spec PC (i.e. integrated graphics)? If not then does it even matter whether I use GDI+ or DirectX 9.0c?
Do not use GDI+. It does everything in software, and it has a rendering model that is not good for performance in software. You'd be better off with just about anything else.
Direct3D or OpenGL (which you can access via SDL if you want a more complete API that is cross-platform) will give you the best performance on hardware that supports it. Direct2D is in the same boat but is not available on Windows XP. My understanding is that, at least in the case of Intel's integrated GPU's, the hardware is able to do simple operations like transforming and composing, and that most of the problems with these GPU's are with games that have high demands for features and performance, and are optimized for ATI/Nvidia cards. If you somehow find a machine where Direct3D is not supported by the video card and is falling back to software, then you might have a problem.
I believe SDL uses DirectDraw on Windows for its non-OpenGL drawing. Somehow I got the impression that DirectDraw does all its operations in software in modern releases of Windows (and given what DirectDraw is used for it never really mattered since the win9x era), but I'm not able to verify that.
The ideal would be a cross-platform vector graphics library that can make use of Direct3D or OpenGL for rendering, but AFAICT no such thing is available. The Cairo graphics library lacks acceleration on Windows, and Mozilla has started a project called Azure that apparently has that but doesn't appear to be designed for use outside of their projects.
I just found this: 2D Rendering in DirectX 8.
It appears that since Microsoft removed DirectDraw after DirectX 7 they expected all 2D drawing to be done using the 3D API. This would explain why I totally failed to find the documentation I was looking for.
The article looks promising so far.
Here's another: 2D Programming in a 3D World

setting up OpenCL with DirectX 9

I'm completely new to OpenCL and GPU programming in general. Right now I am working on a project where I'm trying to see the performance saves that making use of the GPU in a game has. With this, however, I have ran into a snag; how do I set up my Directx project to speak to the OpenCL code base?
I've been googling this for about a week and haven't been able to find anything. If someone could point me in the right direction, I would be greatful.
OpenCL does not have anything to do with DirectX, it's simply another library.
For OpenCL you'll need an implementation ('SDK'), as Khronos don't provide those (they only provide the specifications).
Intel, AMD and Nvidia all provide one, but they have different requirements and limitations. See here for some of the existing implementations
After installing one of these, you'll have the necessary headers and libraries to code against the OpenCL API and link with OpenCL.dll
There are lots of sample sources in the SDKs or online, you have to write the kernel, the rest is mostly boilerplate code for initialization and kernel compilation.
The specific OpenCL extension that allows sharing of OpenCL buffers as textures and vice versa is cl_khr_d3d10_sharing.txt. http://www.khronos.org/registry/cl/extensions/khr/cl_khr_d3d10_sharing.txt
OpenCL has extensions for sharing memory between DirectX and OpenCL (and also between OpenGL and OpenCL.) This allows you to read or write DirectX buffers, including textures from within OpenCL. Ani's answer mentioned the extension for DirectX 10, but since the question is about DirectX 9, the extension you'll actually be using is cl_khr_dx9_media_sharing.
This extension has just 4 functions:
clGetDeviceIDsFromDX9MediaAdapterKHR
This function allows you to get the OpenCL device IDs of the OpenCL device(s) that can share memory with a given Direct3D 9 device.
clCreateFromDX9MediaSurfaceKHR
This function gets an OpenCL cl_mem memory object for a given Direct3D 9 memory object.
clEnqueueAcquireDX9MediaSurfacesKHR
This function locks the specified shared memory object so that you can read and/or write to it from OpenCL.
clEnqueueReleaseDX9MediaSurfacesKHR
This function unlocks the specified memory object from OpenCL, so that Direct3D can read/write it again.
Once you've used the above functions to share and synchronize access to the memory buffers, everything else on both the Direct3D 9 side and the OpenCL side works as it would otherwise with those particular APIs.
Note that your GPU will need to support the cl_khr_dx9_media_sharing extension in order for this to work. You can check the extensions property of the OpenCL platform and device in order to confirm that this extension is supported.
Some NVidia GPUs support a different extension instead, called cl_nv_d3d9_sharing. The basic idea of how it works is the same as with the cl_khr_dx9_media_sharing extension, but the exact details are a bit different. The biggest difference is just that it has different functions for getting cl_mem objects for different types of Direct3D 9 buffers, rather than just one function to cover all of them.

What language and compiler to write ATI GPU code for?

I know Nvidia has CUDA, but what does ATI have? I dont want to use OpenCL because I want to keep as low level to the hardware as possible.
Is it brook, or stream?
The documentation available is pretty pathetic! CUDA seems easy to get programming, but I want to use ATI specifically because of their hardware.
OpenCL is AMD's currently preferred GPU/compute language.
Brook is deprecated.
However, you can write code at a very low level, using AMD's
shader and kernel analyzer
http://developer.amd.com/tools/shader/Pages/default.aspx.
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx
E.g. http://developer.amd.com/tools/shader/PublishingImages/GSA.png
shows OpenCL code, and the Radeon 5870 assembly produced.
You can actually code directly in several forms of "assembly".
Or at least you could - the webpages no longer mention this.
(I used to have this installed for tuning and testing, but do not at the moment.)
More usually, you can code in any of several forms of AMD IL, Intermediate Language,
which is closer to the machine than OpenCL. The kernel analyzer web page says
"If your kernel is an IL kernel Stream, KernelAnalyzer will automatically compile the IL..."
I would recommend that you use OpenCL, and then look at the disassembly and tweak the OpenCL code to be better tuned. But you can work in IL, and probably still can work at an even lower level.

Why byte-addressable memory and not 4-byte-addressable memory?

Why do computers have byte-addressable memory, and not 4-byte-addressable memory (or 8-byte-addressable memory for 64bit)? Yeah, I see how it could be useful sometimes, it just seems inelegant and excessive. Are the advantages substantial, or is it really just because of legacy?
Processors actually do access memory in quantities of 64-bit (x86 did since Pentium or so); 64-bit processors often have a 128-bit bus. Plus, in accessing main memory, you have bursts that fill an entire cache line, which is even larger units of memory.
It's only the addressing that is byte-based; this adds little overhead and is not excessive at all.
Today, you absolutely need byte-based addressing for networking protocols. Implementing TCP with word-based addressing would be difficult: what do you want read() to return if what you received where 17 bytes? Likewise, higher layers are byte-based: HTTP would be fairly difficult to implement if you get a request line like "GET / HTTP/1.0" be presented in units of four bytes. You essentially would have to split the words back into bytes with shift operations and such (which now the processors do in hardware, thanks to byte-based addressing).
Largely historical reasons - it has become the standard that CPUs understand. Here is a good discussion on it:
Generally, a size has to be chosen to
be convenient for both data and
machine instructions. 8 bits (256
values) is enough to accommodate
common characters in English and some
other languages. Designers of 8-bit
processors presumably found that being
able to encode 256 common instructions
as one byte was a "reasonable
tradeoff". And at the time, 8 bits was
also generally enough to encode other
things such as a pixel colour or
screen coordinate. Having a byte size
that is a power of 2 may also have
been felt to be a "neater" design. It
is interesting to note that, for
example, Marxer, E. (1974), Elements
of Data Processing, describes a byte
as being either 6-bit and 8-bit
depending on whether the computer was
of the "octal" or "hexadecimal" type.
Certainly, other sizes were used in the early days.
We needed to settle down on some size for standardization. People chose 8-bit size for the reasons mentioned by Shane above. since then we are stuck with byte addressable memory. now it is impossible to change due to various compatibility issues and the fact that OPCODES are a byte long only. but using a trick, memory is easily made word-addressable to fetch/store data/addresses!

Resources