What's the difference between REF and HAL? - directx

The question wasn't formulated exactly right because it's a bit difficult to do in 1 sentence, so the real question is:
The function which creates the object representing a device in Direct3D9 looks like this.
HRESULT IDirect3D9::CreateDevice(
UINT Adapter,
D3DDEVTYPE DeviceType,
HWND hFocusWindow,
DWORD BehaviorFlags,
D3DPRESENT_PARAMETERS *pPresentationParameters,
IDirect3DDevice9** ppReturnedDeviceInterface
);
Adapter UINT argument refers to a particular video card used on target computer but DeviceType argument refers to either HAL or REF. So what's the point of specifying some particular video card (e.g. 0) and REF device type ? Isn't REF some abstract instance which is emulated by processor and doesn't have any relation to video card?

Basically, you're right. Reference devices implement most DirectX functionality in software and do not rely on the graphics driver. In turn, they are quite slow and should be used only for testing. There are two reasons why you would want a reference device:
If your graphics card does not support the DirectX features you want to use, you can use a reference device because it will support every feature. However, this makes only sense during development (if you're temporarily on a low-end machine).
If you get strange results, it could be a driver issue. To rule that out, you can check with a reference device. If this gives you the same results, the problem is somewhere in your code. If it gives you correct results, the graphics driver is buggy.

Related

STM32 Current Flash Vector Address

I'm working on a dual OS system with STM32F103, I have two separate program that programmed on different FLASH locations. if both of the programs are the same, the only way to know which of them running is just by its start vector address.
But How I Can Read The Current Program Start Vector Address in STM32 ???
After reading the comments, it sounds like what you have/want is a bootloader. If your goal here is to have two different applications, one to do your main processing and real time handling and the other to just program new firmware, then you want to make a bootloader in your default boot flash space.
Bootloaders fundamentally do a few things, everything else is extra.
Check itself using some type of data integrity check like a CRC.
Checks the application
Jumps to the application.
Bootloaders will also program applications in the app space and verify they are programmed correctly before jumping as well. Colin gave some good advice about appending a CRC to the hex file before it is programmed in flash space to verify the applications.
There are a few things to look out for. The first would be the linker script and this is extremely important. A linker script will be used to map input objects to output objects and then determine based upon that script, what memory space they go into. For both of your applications, you need to create a memory map of how you want both programs to sit inside of the flash space. From this point, you can then make linker scripts for both programs so that a hex file can be generated within the parameters of what you deem acceptable flash space for the program. Each project you have will have its own linker script. An example would look something like this:
LR_IROM1 0x08000000 0x00010000 { ; load region size_region
ER_IROM1 0x08000000 0x00010000 { ; load address = execution address
*.o (RESET, +First)
*(InRoot$$Sections)
.ANY (+RO)
}
RW_IRAM1 0x20000000 0x00018000 { ; RW data
.ANY (+RW +ZI)
}
}
This will give RAM for the application to use as well as a starting point for the application.
After that, you can start the bootloader and give it information about where the application space lies for jumping and programming. Once again this is all determined by you from your memory map and both applications' linker scripts. You are going to need to add a separate entry inside of the linker for your CRC and length for a comparison of the calculated versus stored as well. Whatever tool you use to append the CRC to the hex file and have it programmed to flash space, remember to note the location and make it known to the linker script so you can reference those addresses to check integrity later.
After you check everything and it is determined that it is okay to go to the application, you can use some ARM assembly to jump to the starting application address. Before jumping, make sure to disable all peripherals and interrupts that were enabled in the bootloader. As Colin mentioned, these will share RAM, so it is important you de-initialize all used, otherwise, you'll end up with a hard fault.
At this point, the program used another hex file laid out by a linker script, so it should begin executing as planned, as long as you have the correct vector table offset, which gets into your question fully.
As far as your question on the "Flash vector address", I think what your really mean is your interrupt vector table address. An interrupt vector table is a data structure in memory that maps interrupt requests to the addresses of interrupt handlers. This is where the PC register grabs the next available instruction address upon hardware interrupt triggers, for example. You can see this by keeping track of the ARM pipeline in a few lines of assembly code. Each entry of this table is a handler's address. This offset must be aligned with your application, otherwise you will never go into the main function and the program will sit in the application space, but have nothing to do since all handlers addresses are unknown. This is what the SCB->VTOR is for. It is a vector interrupt table offset address register. In this case, there are a few things you can do. Luckily, these are hard-coded inside of STM generated files inside of the file "system_stm32(xx)xx.c" (xx is your microcontroller variant). There is a define for something called VECT_TAB_OFFSET which is the offset in the memory map of the vector table and is assigned to the SCB->VTOR register with the value that is chosen. Your interrupt vector table will always lie at the starting address of your main application, so for the bootloader it can be 0x00, but for the application, it will be the subtraction of the starting address of the application space, and the first addressable flash address of the microcontroller.
/************************* Miscellaneous Configuration ************************/
/*!< Uncomment the following line if you need to relocate your vector Table in
Internal SRAM. */
/* #define VECT_TAB_SRAM */
#define VECT_TAB_OFFSET 0x00 /*!< Vector Table base offset field.
This value must be a multiple of 0x200. */
/******************************************************************************/
Make sure you understand what is expected from the micro side using STM documentation before programming things. Vector tables in this chip can only be in multiples of 0x200. But to answer your question, this address can be determined by a few things. Your memory map, and eventually, you will have a hard-coded reference to it as a define. You can figure it out from there.
Hope this helps and good luck to you on your application.

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

How do I choose a pixel format when creating a new Texture2D?

I'm using the SharpDX Toolkit, and I'm trying to create a Texture2D programmatically, so I can manually specify all the pixel values. And I'm not sure what pixel format to create it with.
SharpDX doesn't even document the toolkit's PixelFormat type (they have documentation for another PixelFormat class but it's for WIC, not the toolkit). I did find the DirectX enum it wraps, DXGI_FORMAT, but its documentation doesn't give any useful guidance on how I would choose a format.
I'm used to plain old 32-bit bitmap formats with 8 bits per color channel plus 8-bit alpha, which is plenty good enough for me. So I'm guessing the simplest choices will be R8G8B8A8 or B8G8R8A8. Does it matter which I choose? Will they both be fully supported on all hardware?
And even once I've chosen one of those, I then need to further specify whether it's SInt, SNorm, Typeless, UInt, UNorm, or UNormSRgb. I don't need the sRGB colorspace. I don't understand what Typeless is supposed to be for. UInt seems like the simplest -- just a plain old unsigned byte -- but it turns out it doesn't work; I don't get an error, but my texture won't draw anything to the screen. UNorm works, but there's nothing in the documentation that explains why UInt doesn't. So now I'm paranoid that UNorm might not work on some other video card.
Here's the code I've got, if anyone wants to see it. Download the SharpDX full package, open the SharpDXToolkitSamples project, go to the SpriteBatchAndFont.WinRTXaml project, open the SpriteBatchAndFontGame class, and add code where indicated:
// Add new field to the class:
private Texture2D _newTexture;
// Add at the end of the LoadContent method:
_newTexture = Texture2D.New(GraphicsDevice, 8, 8, PixelFormat.R8G8B8A8.UNorm);
var colorData = new Color[_newTexture.Width*_newTexture.Height];
_newTexture.GetData(colorData);
for (var i = 0; i < colorData.Length; ++i)
colorData[i] = (i%3 == 0) ? Color.Red : Color.Transparent;
_newTexture.SetData(colorData);
// Add inside the Draw method, just before the call to spriteBatch.End():
spriteBatch.Draw(_newTexture, new Vector2(0, 0), Color.White);
This draws a small rectangle with diagonal lines in the top left of the screen. It works on the laptop I'm testing it on, but I have no idea how to know whether that means it's going to work everywhere, nor do I have any idea whether it's going to be the most performant.
What pixel format should I use to make sure my app will work on all hardware, and to get the best performance?
The formats in the SharpDX Toolkit map to the underlying DirectX/DXGI formats, so you can, as usual with Microsoft products, get your info from the MSDN:
DXGI_FORMAT enumeration (Windows)
32-bit-textures are a common choice for most texture scenarios and have a good performance on older hardware. UNorm means, as already answered in the comments, "in the range of 0.0 .. 1.0" and is, again, a common way to access color data in textures.
If you look at the Hardware Support for Direct3D 10Level9 Formats (Windows) page you will see, that DXGI_FORMAT_R8G8B8A8_UNORM as well as DXGI_FORMAT_B8G8R8A8_UNORM are supported on DirectX 9 hardware. You will not run into compatibility-problems with both of them.
Performance is up to how your Device is initialized (RGBA/BGRA?) and what hardware (=supported DX feature level) and OS you are running your software on. You will have to run your own tests to find it out (though in case of these common and similar formats the difference should be a single digit percentage at most).

Why dead code in OpenCL kernel influence result in Nvidia GTX550ti?

I am using OpenCL dev software of Nvidia on GTX550ti graphics card, and encounter a strange problem. (I am freshman for OpenCL).
My kernel code is like this:
__kernel void kernel_name(...)
{
size_t d = get_local_id(0);
char abc[8];
...
}
Actually, the char abc[8] is useless (dead code) for my case. But, if I have the char abc[8] in my kernel code, the result will be totally messy and the running time of kernel will be much longer (2095712 ns). If I comment out the char abc[8], the result becomes correct, and the running time of kernel becomes shorter (697856 ns). The compiler of kernel won't wipe off the dead code?
The above is just an explicit example that I can repeat. I also encounter more stranger case that one program gets different result when run at different time in totally the same environment.
Is that related to memory allocation or..? Anyone can give me some advice on how to find the problem?
By the way, oclDeviceQuery output information is listed as follows:
Platform Version = OpenCL 1.1
CUDA 4.2.1,
SDK Revision = 7027912
My OS is Windows XP.
Today is 2012-07-17, and I think I have resolved this problem.
don't use #include in kernel source file.
don't use ultra length line (for example, you write program to generate some line data for kernel source file) in kernel source file.
You're right, that shouldn't effect anything.
That's not your real code though, and I suspect given those run-times that your kernel isn't a simple thing. Possibly you're pushing your locals over some limit which means that variables are having to be stored in some slower memory which pushes your run-times up.
Something like that might also cause a change in behaviour if you had an uninitialised variable bug somewhere. In the fast store it happens to get a value that works. In the slow store it gets something else.
To check this theory I'd try to remove some other local data structure and see if it has the same effect. Anything else 8 bytes or larger should have the same effect.
...of course it's possibly you've found a bug in the OpenCL implementation, but that's easy to check. Just compile the kernel for a different OpenCL device, e.g. the CPU. This is worth doing anyway because different compiler pick up different issues.
Other than that I think you're back to standard debug techniques.
BTW: at one point in your question you call the array abs[8] rather than abc[8]. I assume that's a typo, but if it isn't then that could be your problem as the abs name will clash with the abs() function. That could confuse a stupid compiler.

Question about cl_mem in OpenCL

I have been using cl_mem in some of my OpenCL boilerplate code, but I have been using it through context and not a sharp understanding of what exactly it is. I have been using it as a type for the memory I push on and off the board, which has so far been floats. I tried looking at the OpenCL docs, but cl_mem doesn't show up (does it?). Is there any documentation on it, or is it simple and can someone explain.
The cl_mem type is a handle to a "Memory Object" (as described in Section 3.5 of the OpenCL 1.1 Spec). These essentially are inputs and outputs for OpenCL kernels, and are returned from OpenCL API calls in host code such as clCreateBuffer
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *errcode_ret)
The memory areas represented can be permitted different access patterns e.g. Read Only, or be allocated in different memory regions, depending on the flags set in the create buffer calls.
The handle is typically stored to allow a later call to release the memory, e.g:
cl_int clReleaseMemObject (cl_mem memobj)
In short, it provides an abstraction over where the memory actually is: you can copy data into the associated memory or back out via the OpenCL APIs clEnqueueWriteBuffer and clEnqueueReadBuffer, but the OpenCL implementation can allocate the space where it wants.
For the computer a cl_mem is a number (like a file handler for Linux) that is reserved for the use as a "memory identifier"( the API/driver whatever stores information about your memory under this number that it knows what it holds/how big it is and stuff like that)

Resources