Address Translation on the GPU

Address Translation on the GPU - memory

I'm trying to understand what data structure the GPU uses. But I don't really understand what address Translation is and I even searched it on google and found nothing really good. The only thing I found was, a translation from virtual memory to physical memory and the code below uses integer addressing. But now I'm even more confused with integer addressed Texture (and float point adressed Texture?). So my question is what is address Translation from 1D to 2D
// 1D to 2D Address Translation for Integer-Addressed Textures
float2 addrTranslation_1DtoRECT( float address1D, float2 texSize ) {
// Parameter explanation:
// - "address1D" is a 1D array index into an array of size N
// - "texSize" is size of RECT texture that stores the 1D array
// Step 1) Convert 1D address from [0,N] to [0,texSize.y]
float CONV_CONST = 1.0 / texSize.x;
float normAddr1D = address1D * CONV_CONST;
// Step 2) Convert the [0,texSize.y] 1D address to 2D.
// Performing Step 1 first allows us to efficiently compute the
// 2D address with "frac" and "floor" instead of modulus
// and divide. Note that the "floor" of the y-component is
// performed by the texture system.
float2 address2D = float2( frac(normAddr1D) * texSize.x, normAddr1D );
return address2D;
}
The site I use to learn
Has it something to do with Virtual Memory and Physical memory?

Related

How can I find the brightest point in a CIImage (in Metal maybe)?

I created a custom CIKernel in Metal. This is useful because it is close to real-time. I am avoiding any cgcontext or cicontext that might lag in real time. My kernel essentially does a Hough transform, but I can't seem to figure out how to read the white points from the image buffer.
Here is kernel.metal:
#include <CoreImage/CoreImage.h>
extern "C" {
namespace coreimage {
float4 hough(sampler src) {
// Math
// More Math
// eventually:
if (luminance > 0.8) {
uint2 position = src.coord()
// Somehow add this to an array because I need to know the x,y pair
}
return float4(luminance, luminance, luminance, 1.0);
}
}
}
I am fine if this part can be extracted to a different kernel or function. The caveat to CIKernel, is its return type is a float4 representing the new color of a pixel. Ideally, instead of a image -> image filter, I would like an image -> array sort of deal. E.g. reduce instead of map. I have a bad hunch this will require me to render it and deal with it on the CPU.
Ultimately I want to retrieve the qualifying coordinates (which there can be multiple per image) back in my swift function.
FINAL SOLUTION EDIT:
As per suggestions of the answer, I am doing large per-pixel calculations on the GPU, and some math on the CPU. I designed 2 additional kernels that work like the builtin reduction kernels. One kernel returns a 1 pixel high image of the highest values in each column, and the other kernel returns a 1 pixel high image of the normalized y-coordinate of the highest value:
/// Returns the maximum value in each column.
///
/// - Parameter src: a sampler for the input texture
/// - Returns: maximum value in for column
float4 maxValueForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxV = v;
}
}
return float4(maxV, maxV, maxV, 1.0);
}
/// Returns the normalized coordinate of the maximum value in each column.
///
/// - Parameter src: a sampler for the input texture
/// - Returns: normalized y-coordinate of the maximum value in for column
float4 maxCoordForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
float maxY = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxY = y / size.y;
maxV = v;
}
}
return float4(maxY, maxY, maxY, 1.0);
}
This won't give every pixel where luminance is greater than 0.8, but for my purposes, it returns enough: the highest value in each column, and its location.
Pro: copying only (2 * image width) bytes over to the CPU instead of every pixel saves TONS of time (a few ms).
Con: If you have two major white points in the same column, you will never know. You might have to alter this and do calculations by row instead of column if that fits your use-case.
FOLLOW UP:
There seems to be a problem in rendering the outputs. The Float values returned in metal are not correlated to the UInt8 values I am getting in swift.
This unanswered question describes the problem.
Edit: This answered question provides a very convenient metal function. When you call it on a metal value (e.g. 0.5) and return it, you will get the correct value (e.g. 128) on the CPU.

Check out the filters in the CICategoryReduction (like CIAreaAverage). They return images that are just a few pixels tall, containing the reduction result. But you still have to render them to be able to read the values in your Swift function.
The problem for using this approach for your problem is that you don't know the number of coordinates you are returning beforehand. Core Image needs to know the extend of the output when it calls your kernel, though. You could just assume a static maximum number of coordinates, but that all sounds tedious.
I think you are better off using Accelerate APIs for iterating the pixels of your image (parallelized, super efficiently) on the CPU to find the corresponding coordinates.
You could do a hybrid approach where you do the per-pixel heavy math on the GPU with Core Image and then do the analysis on the CPU using Accelerate. You can even integrate the CPU part into your Core Image pipeline using a CIImageProcessorKernel.

Is SceneKit's SCNMatrix4 stored as column or row-major?

The source of my confusion is the documentation on SCNMatrix4 from Apple:
SceneKit uses matrices to represent coordinate space transformations,
which in turn can represent the combined position, rotation or
orientation, and scale of an object in three-dimensional space.
SceneKit matrix structures are in row-major order, so they are
suitable for passing to shader programs or OpenGL APIs that accept matrix parameters.
This seems contradictory since OpenGL uses column-major transformation matrices!
In general the documentation of Apple on this topic is poor:
GLKMatrix4 does not talk about column vs. row major at all, but
I am certain from all discussions online that GLKMatrix is plain and
simple column-major.
simd_float4x4 does not talk about the
topic, but at least the variable names make it clear that it is
stored as columns:
From simd/float.h:
typedef struct { simd_float4 columns[4]; } simd_float4x4;
Some discussion online about SCNMatrix4 talk as if SCNMatrix4 is different from GLKMatrix4, but looking at the code for SCNMatrix4 give some hints that it might just be column-order as expected:
/* Returns a transform that translates by '(tx, ty, tz)':
* m' = [1 0 0 0; 0 1 0 0; 0 0 1 0; tx ty tz 1]. */
NS_INLINE SCNMatrix4 SCNMatrix4MakeTranslation(float tx, float ty, float tz) {
SCNMatrix4 m = SCNMatrix4Identity;
m.m41 = tx;
m.m42 = ty;
m.m43 = tz;
return m;
}
So is SCNMatrix4 row-major (storing translations in m14, m24, m34) or column-major (storing translations in m41, m42, m43)?

SCNMatrix4 stores the translation in m41, m42 and m43, as is GLKMatrix4. This little Playground confirms it as well as its definition.
import SceneKit
let translation = SCNMatrix4MakeTranslation(1, 2, 3)
translation.m41
// 1
translation.m42
// 2
translation.m43
// 3
I have no idea why this is wrong in the documentation though, probably just a mistake.

Increase speed of OpenCL image processing

I think the execution time of my kernel is too high. It's job is it to just blend two images together using either addition, subtraction, division or multiplication.
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = convert_float2(xy) / (float2) (get_image_width(output), get_image_height(output));\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
SETUP_KERNEL(div, /)
SETUP_KERNEL(add, +)
SETUP_KERNEL(mult, *)
SETUP_KERNEL(sub, -)
As you can see I use macros to quickly define the different kernels. (Should I better use functions for that?)
Somehow the kernel managed to take 3ms on a GTX 970.
What can I do to increase the performance of this particular kernel?
Should I split it into different programs?

Bilinear interpolation is 2x-3x slower than nearest neighbor way. Are you sure you are not using nearest neighbor in opengl?
What it does in background(by the sampler) is something like:
R1 = ((x2 – x)/(x2 – x1))*Q11 + ((x – x1)/(x2 – x1))*Q21
R2 = ((x2 – x)/(x2 – x1))*Q12 + ((x – x1)/(x2 – x1))*Q22
After the two R values are calculated, the value of P can finally be calculated by a weighted average of R1 and R2.
P = ((y2 – y)/(y2 – y1))*R1 + ((y – y1)/(y2 – y1))*R2
The calculation will have to be repeated for the red, green, blue, and optionally the alpha component of.
http://supercomputingblog.com/graphics/coding-bilinear-interpolation/
Or it is simply Nvidia implemented fastpath for opengl and complete path for opencl image access. For example, for amd, image writes are complete path, smaller than 32bit data accesses are complete path, image reads are fastpath.
Another option: Z-order is better suited to compute-divergence of those image data and opencl's non Z-order(suspicious, maybe not) is worse.

Division is often expenisve so
I suggest moving calculation of normalizedCoords to host side.
On host side:
float normalized_x[output_width]; // initialize with [0..output_width-1]/output_width
float normalized_y[output_height]; // initialize with [0..output_height-1]/output_height
Change kernel to:
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
global float *normalized_x, \
global float *normalized_y, \
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = (float2) (normalized_x[xy.x],normalized_y[xy.y] );\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
Also you can try not using normalized cooardinates by using the same technique.
This would be more beneficial if the sizes of the input images dont change often.

Read and display animations from *.FBX file

I want to read 3d model from fbx file and display it within openGL es 2.0 engine on iPhone (not using Unity) and also I want to display animations for read 3d object.
How can i get animations from fbx file?
Currently I able to get list of poses names with, as I understood, coming with transformation matrix, full list of poses in layer, stacks, layers, and a lot of curves.
How this all information can be combined for displaying proper animations?
I also try to parse some info within TakeInfo, but result a little bit strange as for me, for example:
FbxTakeInfo *ltakeInfo = pScene->GetTakeInfo(lAnimStack->GetName());
FbxTime start = ltakeInfo->mLocalTimeSpan.GetStart();
FbxTime end = ltakeInfo->mLocalTimeSpan.GetStop();
self.startTime = start.GetSecondDouble();
self.endTime = end.GetSecondDouble();
here i got start = 0 and end = 0.014 for each of parsed layers, so i guees this is incorrect (fbx file that I want to display contains 1 mesh with simple 5 sec duration animations).
Update
After few hours of investigating I got next things:
For reference this is structure of test obj that i want to display:
Here u can see a lot of bones (more concrete - 19)
I able to get (as noted above) a 19 animated obj in list (like names of bone/obj) and 19 groups of curves within 151 frames each (with frame rate 30 exactly 5 sec of anim. - 30*5=150 + 1 last identity matrix).
If I try use each of curve group to my mesh one by one (I able to parse 1 mesh only) I see animation of different part of mesh applied to all mesh (for example vertical rotation or horizontal translation), so I think that this curves in each group should be applied exactly to specific bone and as result i will got animation for my mesh. The problem is that i don't know how to animate only selected bone vertex part only.
Now the problem is how to apply this all animations divided in to separate group specific to bone to whole obj (because I have only one mesh)?
How can I get 1 global curve's list for each frame from list with all curve's groups?
Update2
Thanks to #codetiger advice I follow instruction within provided link in comment and with that technique I'm able to retrieve list of bone-specific mat's with start and end time and required transforms, but this is almost same like I got before with curves - the only difference that with curves I should create mat from 9 curves (translate/ scale/ rotate for x y z) instead of using full matrix, but problem are still present - how can i combine them in 1 global matrix?
The code that I use (found few links for it):
FbxNode* modelNode = _fbxScene->GetRootNode();
FbxAMatrix geometryTransform = GetGeometryTransformation(modelNode);
for (unsigned int deformerIndex = 0; deformerIndex < numOfDeformers; ++deformerIndex) {
FbxSkin* currSkin = reinterpret_cast<FbxSkin*>(mesh->GetDeformer(deformerIndex, FbxDeformer::eSkin));
if (!currSkin) {
continue;
}
unsigned int numOfClusters = currSkin->GetClusterCount();
for (unsigned int clusterIndex = 0; clusterIndex < numOfClusters; ++clusterIndex) {
FbxCluster* currCluster = currSkin->GetCluster(clusterIndex);
std::string currJointName = currCluster->GetLink()->GetName();
FbxAMatrix transformMatrix;
FbxAMatrix transformLinkMatrix;
FbxAMatrix globalBindposeInverseMatrix;
currCluster->GetTransformMatrix(transformMatrix);
currCluster->GetTransformLinkMatrix(transformLinkMatrix);
globalBindposeInverseMatrix = transformLinkMatrix.Inverse() * transformMatrix * geometryTransform;
FbxAnimStack* currAnimStack = _fbxScene->GetSrcObject<FbxAnimStack>(0);
FbxString animStackName = currAnimStack->GetName();
char *mAnimationName = animStackName.Buffer();
FbxTakeInfo* takeInfo = _fbxScene->GetTakeInfo(animStackName);
FbxTime start = takeInfo->mLocalTimeSpan.GetStart();
FbxTime end = takeInfo->mLocalTimeSpan.GetStop();
FbxLongLong mAnimationLength = end.GetFrameCount(FbxTime::eFrames24) - start.GetFrameCount(FbxTime::eFrames24) + 1;
for (FbxLongLong i = start.GetFrameCount(FbxTime::eFrames24); i <= end.GetFrameCount(FbxTime::eFrames24); ++i) {
FbxTime currTime;
currTime.SetFrame(i, FbxTime::eFrames24);
FbxAMatrix currentTransformOffset = modelNode->EvaluateGlobalTransform(currTime) * geometryTransform;
FbxAMatrix mat = currentTransformOffset.Inverse() * currCluster->GetLink()->EvaluateGlobalTransform(currTime);
}
}
}
Here I receive 121 matrix instead of 151 - but duration of some matrix transforms take more than duration of 1 frame draw, so the q-ty here I think also correct
float duration = end.GetSecondDouble() - start.GetSecondDouble(); //return 5 as and expected
I guess that that Autodesk SDK there is a way to get 1 Global transform per each drawCall
Any suggestions? Is this possible?
Add bounty for anyone who can describe how to display animations on iPhone within openGLES 2.0 with Autodesk SDK... (sorry typo instead of Facebook)
Here what i'm able to get
Original object in blender
If i draw bone separately (same VBO with different transform and only indices for each bone accordantly)

Here is how to render animated mesh in OpenGL ES. This will give you fare information on what details you need to read from the file.
Method 1: (GPU Skinning)
This method works only on limited number of bones based on the capabilities of the hardware. Usually it depends on the amount of matrices you can send to the shader.
Bind the mesh information to the GPU using BindAttributes once and send the matrixes to the shader in uniforms.
Step 1: Read all Bone Matrix and create an array of Matrices and send this data to the shader in uniforms.
Step 2: In the Vertex Shader, calculate the gl_position like the below shader
attribute vec3 vertPosition;
attribute vec3 vertNormal;
attribute vec2 texCoord1;
attribute vec3 vertTangent;
attribute vec3 vertBitangent;
attribute vec4 optionalData1;
attribute vec4 optionalData2;
uniform mat4 mvp, jointTransforms[jointsSize];
void main()
{
mat4 finalMatrix;
int jointId = int(optionalData1.x);
if(jointId > 0)
finalMatrix = jointTransforms[jointId - 1] * optionalData2.x;
jointId = int(optionalData1.y);
if(jointId > 0)
finalMatrix = finalMatrix + (jointTransforms[jointId - 1] * optionalData2.y);
jointId = int( optionalData1.z);
if(jointId > 0)
finalMatrix = finalMatrix + (jointTransforms[jointId - 1] * optionalData2.z);
jointId = int( optionalData1.w);
if(jointId > 0)
finalMatrix = finalMatrix + (jointTransforms[jointId - 1] * optionalData2.w);
gl_Position = mvp * finalMatrix * vec4(vertPosition, 1.0);
}
In this shader, I've send the weight paint information in the attributes optionalData1 & optionalData2. The data is packed like this: (BoneId1, BoneId2, BoneId3, BoneId4) and (Weight4Bone1, Weight4Bone2, Weight4Bone3, Weight4Bone4). So in this case you are limited with a maximum of 4 bones affecting each attributes. The fragment shader part is usual.
Method 2: (CPU Skinning)
If you cannot live with the limitation in the GPU skinning, then the only way is to do the calculations in the CPU side.
Step 1: Calculate the vertex position in the current frame by multiplying the matrixes that belong to the bones that affects the vertex.
Step 2: Collect the new positions for the current frame and send the information to GPU in Vertex Attributes.
Step 3: Recalculate new Vertex positions in every frame and update the attributes buffer.
This method bring the load of the calculations to the CPU.

Direct3D 11 not rasterizing any vertices

I'm trying to render a simple triangle on screen using Direct3D 11, but nothing shows up. Here are my vertices:
SimpleVertex vertices[ 3 ] =
{
{ XMFLOAT3( -1.0f, -1.0f, 0.0f ) },
{ XMFLOAT3( 1.0f, -1.0f, 0.0f ) },
{ XMFLOAT3( -1.0f, 1.0f, 0.0f ) },
};
The expected output is a triangle with one point in the top left corner of the screen, one point in the top right corner of the screen, and one point in the bottom left corner of the screen. However, nothing is being rendered anywhere.
I'm not performing any matrix transformations, and the vertex shader just passes the input directly to the output. Everything seems to be set up correctly, and when I use the graphics debugger in Visual Studio 2012, the correct vertex position is being passed to the vertex shader. However, it skips directly from the vertex shader stage to the output merger stage in the pipeline. I assume this means that nothing is being sent to the pixel shader, which would again mean that the vectors are being discarded in the rasterizer stage. Why is this happening?
Here is my rasterizer state:
D3D11_RASTERIZER_DESC rasterizerDesc;
rasterizerDesc.AntialiasedLineEnable = false;
rasterizerDesc.CullMode = D3D11_CULL_NONE;
rasterizerDesc.DepthBias = 0;
rasterizerDesc.DepthBiasClamp = 0.0f;
rasterizerDesc.DepthClipEnable = true;
rasterizerDesc.FillMode = D3D11_FILL_SOLID;
rasterizerDesc.FrontCounterClockwise = false;
rasterizerDesc.MultisampleEnable = false;
rasterizerDesc.ScissorEnable = false;
rasterizerDesc.SlopeScaledDepthBias = 0.0f;
And my viewport (width/height are the window client area matching my back buffer, which are set to 1024x576 in my test setup):
D3D11_VIEWPORT viewport;
viewport.Height = static_cast< float >( height );
viewport.MaxDepth = 1.0f;
viewport.MinDepth = 0.0f;
viewport.TopLeftX = 0.0f;
viewport.TopLeftY = 0.0f;
viewport.Width = static_cast< float >( width );
Can anyone see what is making the rasterize stage drop my vertices? Or are there any other parts of my D3D setup that could be causing this?

i found this on the internet .. it took absolulely ages to load so i copied and pasted i have highlighted in bold an interesting point.
The D3D_OVERLOADS constructors defined in row 11 offers a convenient way for C++ programmers to create transformed and lit vertices with D3DTLVERTEX.
_D3DTLVERTEX(const D3DVECTOR& v, float _rhw, D3DCOLOR _color,
D3DCOLOR _specular, float _tu, float _tv)
{
sx = v.x;
sy = v.y;
sz = v.z;
rhw = _rhw;
color = _color;
specular = _specular;
tu = _tu;
tv = _tv;
}
The system requires a vertex position that has already been transformed. So the x and y values must be in screen coordinates, and z must be the depth value of the pixel, which could be used in a z-buffer (we won't use a z-buffer here). Z values can range from 0.0 to 1.0, where 0.0 is the closest possible position to the viewer, and 1.0 is the farthest position still visible within the viewing area. Immediately following the position, transformed and lit vertices must include an RHW (reciprocal of homogeneous W) value.
Before rasterizing the vertices, they have to be converted from homogeneous vertices to non-homogeneous vertices, because the rasterizer expects them this way. Direct3D converts the homogeneous vertices to non-homogeneous vertices by dividing the x-, y-, and z-coordinates by the w-coordinate, and produces an RHW value by inverting the w-coordinate. This is only done for vertices which are transformed and lit by Direct3D.
The RHW value is used in multiple ways: for calculating fog, for performing perspective-correct texture mapping, and for w-buffering (an alternate form of depth buffering).
With D3D_OVERLOADS defined, D3DVECTOR is declared as
_D3DVECTOR(D3DVALUE _x, D3DVALUE _y, D3DVALUE _z);
D3DVALUE is the fundamental Direct3D fractional data type. It's declared in d3dtypes.h as
typedef float D3DVALUE, *LPD3DVALUE;
The source shows that the x and y values for the D3DVECTOR are always 0.0f (this will be changed in InitDeviceObjects()). rhw is always 0.5f, color is 0xfffffff and specular is set to 0. Only the tu1 and tv1 values are differing between the four vertices. These are the coordinates of the background texture.
In order to map texels onto primitives, Direct3D requires a uniform address range for all texels in all textures. Therefore, it uses a generic addressing scheme in which all texel addresses are in the range of 0.0 to 1.0 inclusive.
If, instead, you decide to assign texture coordinates to make Direct3D use the bottom half of the texture, the texture coordinates your application would assign to the vertices of the primitive in this example are (0.0,0.0), (1.0,0.0), (1.0,0.5), and (0.0,0.5). Direct3D will apply the bottom half of the texture as the background.
Note: By assigning texture coordinates outside that range, you can create certain special texturing effects.
You will find the declaration of D3DTextr_CreateTextureFromFile() in the Framework source in d3dtextr.cpp. It creates a local bitmap from a passed file. Textures could be created from *.bmp and *.tga files. Textures are managed in the framework in a linked list, which holds the info per texture, called texture container.
struct TextureContainer
{
TextureContainer* m_pNext; // Linked list ptr
TCHAR m_strName[80]; // Name of texture (doubles as image filename)
DWORD m_dwWidth;
DWORD m_dwHeight;
DWORD m_dwStage; // Texture stage (for multitexture devices)
DWORD m_dwBPP;
DWORD m_dwFlags;
BOOL m_bHasAlpha;
LPDIRECTDRAWSURFACE7 m_pddsSurface; // Surface of the texture
HBITMAP m_hbmBitmap; // Bitmap containing texture image
DWORD* m_pRGBAData;
public:
HRESULT LoadImageData();
HRESULT LoadBitmapFile( TCHAR* strPathname );
HRESULT LoadTargaFile( TCHAR* strPathname );
HRESULT Restore( LPDIRECT3DDEVICE7 pd3dDevice );
HRESULT CopyBitmapToSurface();
HRESULT CopyRGBADataToSurface();
TextureContainer( TCHAR* strName, DWORD dwStage, DWORD dwFlags );
~TextureContainer();
};

The problem was actually in my rendering logic. I set the stride of the vertex buffer to 0 instead of the size of my vertex struct. Changed that, and it renders just fine!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Address Translation on the GPU - memory

Related

How can I find the brightest point in a CIImage (in Metal maybe)?

Is SceneKit's SCNMatrix4 stored as column or row-major?

Increase speed of OpenCL image processing

Read and display animations from *.FBX file

Direct3D 11 not rasterizing any vertices

Categories

Resources