What i'm doing is GPGPU on WebGL and I don't know the access pattern which I'd be talking about applies to general graphics and gaming programs. In our code, frequently, we come across data which needs to be summarized or reduced per output texel. A very simple example is matrix multiplication during which, for every output texel, your return a value which is a dot product of a row of one input and a column of the other input.
This has been the sore point of our performance because of not so much the computation but multiplied data access. So I've been trying to find a pattern of reads or data layouts which would expedite this operation and I have been completely unsuccessful.
I will be describing some assumptions and some schemes below. The sample code for all these are under https://github.com/jeffsaremi/webgl-experiments
Unfortunately due to size I wasn't able to use the 'snippet' feature of StackOverflow. NOTE: All examples write to console not the html page.
Base matmul implementation: Example: [2,3]x[3,4]->[2,4] . This produces in a simplistic form 2 textures of (w:3,h:2) and (w:4,h:3). For each output texel I will be reading along the X axis of the left texture but going along the Y axis of the right texture. (see webgl-matmul.html)
Assuming that GPU accesses data similar to CPU -- that is block by block -- if I read along the width of the texture I should be hitting the cache pretty often.
For this, I'd layout both textures in a way that I'd be doing dot products of corresponding rows (along texture width) only. Example: [2,3]x[4,3]->[2,4] . Note that the data for the right texture is now transposed so that for each output texel I'd be doing a dot product of one row from the left and one row from the right. (see webgl-matmul-shared-alongX.html)
To ensure that the above assumption is indeed working, I created a negative test also. In this test I'd be reading along the Y axis of both left and right textures which should have the worst performance ever. Data is pre-transposed so that the results make sense. Example: [3,2]x[3,4]->[2,4]. (see webgl-matmul-shared-alongY.html).
So I ran these -- and I hope you could do as well to see -- and I found no evidence to support existence or non-existence of such caching behavior. You need to run each example a few times to get consistent results for comparison.
Then I came along this paper http://fileadmin.cs.lth.se/cs/Personal/Michael_Doggett/pubs/doggett12-tc.pdf which in short claims that the GPU caches data in blocks (or tiles as I call them).
Based on this promising lead I created a version of matmul (or dot product) which uses blocks of 2x2 to do its calculation. Prior to using this of course I had to rearrange my inputs into such layout. The cost of that re-arrangement is not included in my comparison. Let's say I could do that once and run my matmul many times after. Even this scheme did not contribute anything to the performance if not taking something away. (see webgl-dotprod-tiled.html).
A this point I am completely out of ideas and any hints would be appreciated.
thanks
Related
I'm just starting out with procedural generation and I've made a program that generates lines using a D0L-systems by following Paul Bourke's website. For the first two simple examples it works great, but when I input the rules of the L-System Leaf, my results are incorrect as can be seen on this image.
Could any of you more experienced people point out where I might be going wrong? I'm pretty sure that I'm misunderstanding something about the usage of the length factor. In my case, lengthFactor is a static float, that is set once before the generation starts and is used to multiply/divide line's length in the current drawing state. lenghFactor itself won't change during the generation.
I'm using OpenGL for rendering and programming in C++.
I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!
I want to get the properly rendered projection result from a Stage3D framework that presents something of a 'gray box' interface via its API. It is gray rather than black because I can see this critical snippet of source code:
matrix3D.copyFrom (renderable.getRenderSceneTransform (camera));
matrix3D.append (viewProjection);
The projection rendering technique that perfectly suits my needs comes from a helpful tutorial that works directly with AGAL rather than any particular framework. Its comparable rendering logic snippet looks like this:
cube.mat.copyToMatrix3D (drawMatrix);
drawMatrix.prepend (worldToClip);
So, I believe the correct, general summary of what is going on here is that both pieces of code are setting up the proper combined matrix to be sent to the Vertex Shader where that matrix will be a parameter to the m44 AGAL operation. The general description is that the combined matrix will take us from Object Local Space through Camera View Space to Screen or Clipping Space.
My problem can be summarized as arising from my ignorance of proper matrix operations. I believe my failed attempt to merge the two environments arises precisely because the semantics of prepending one matrix to another is not, and is never intended to be, equivalent to appending that matrix to the other. My request, then, can be summarized in this way. Because I have no control over the calling sequence that the framework will issue, e.g., I must live with an append operation, I can only try to fix things on the side where I prepare the matrix which is to be appended. That code is not black-boxed, but it is too complex for me to know how to change it so that it would meet the interface requirements posed by the framework.
Is there some sequence of inversions, transformations or other manuevers which would let me modify a viewProjection matrix that was designed to be prepended, so that it will turn out right when it is, instead, appended to the Object's World Space coordinates?
I am providing an answer more out of desperation than sure understanding, and still hope I will receive a better answer from those more knowledgeable. From Dunn and Parberry's "3D Math Primer" I learned that "transposing the product of two matrices is the same as taking the product of their transposes in reverse order."
Without being able to understand how to enter text involving superscripts, I am not sure if I can reduce my approach to a helpful mathematical formulation, so I will invent a syntax using functional notation. The equivalency noted by Dunn and Parberry would be something like:
AB = transpose (B) x transpose (A)
That comes close to solving my problem, which problem, to restate, is really just a problem arising out of the fact that I cannot control the behavior of the internal matrix operations in the framework package. I can, however, perform appropriate matrix operations on either side of the workflow from local object coordinates to those required by the GPU Vertex Shader.
I have not completed the test of my solution, which requires the final step to be taken in the AGAL shader, but I have been able to confirm in AS3 that the last 'un-transform' does yield exactly the same combined raw data as the example from the author of the camera with the desired lens properties whose implementation involves prepending rather than appending.
BA = transpose (transpose (A) x transpose (B))
I have also not yet tested to see if these extra calculations are so processing intensive as to reduce my application frame rate beyond what is acceptable, but am pleased at least to be able to confirm that the computations yield the same result.
I've recently taken the plunge into DirectX and have been messing around a little with Anim8or, and have discovered several file types that models can be exported to that are text based. I've particularly taken to VTX files. I've learned how to parse some basics out of it, but I'm obviously missing a few things.
It starts with a .Faceset with is immediately (on the same line) followed by the number of meshes in the file.
For each mesh, there is one .Vertex section and one .Index section in that order and the first pair of .Vertex/.Index sections are the first mesh, the second set are the second mesh and so on as you'd expect.
In a .Vertex section of the file, there's 8 numbers per line and an undefined number of lines (unless you want to trust the comments Anim8or has put just before the section, but that doesn't seem to be part of the specs of the file, just Anim8or being kind). The first 3 numbers correspond to X, Y, and Z coordinates for a particular point that'll later be used as a vertex, the other 5 I have no idea. A majority of the time, the last 2 numbers are both 0, but I've noticed that's not ALWAYS true, just usually true.
Next comes the matching .Index section. This section has 4 numbers. The first 3 are reference numbers to the Vertexes previously stated and the 3 points mark a triangle in the model. 0 meaning the first mentioned Vertex, 1 meaning the next one, and so on, like a zero-based array. The 4th number appears to always be -1, I can't figure out what importance it has and I can't promise it's ALWAYS -1. In case you can't tell, I'm not too certain about anything in this file type.
There's also other information in the file that I'm choosing to ignore right now because I'm new and don't want to overcomplicate things too much. Such as after every .Index section is:
.Brdf
// Ambient color
0.431 0.431 0.431
// Diffuse color
0.431 0.431 0.431
// Specular color and exponent
1 1 1 2
// Kspecular = 0.5
// end of .Brdf
It appears to me this is about the surface of the mesh just described. But it's not needed for placement of meshes so I moved past it for now.
Moving on to the real problem... I can load a VTX file when there's only one mesh in the VTX file (meaning the .FaceSet is 1). I can almost successfully load a VTX file that has multiple meshes, each mesh is successfully structured, but not properly placed in relation to the other meshes. I downloaded an AT-AT model from an Anim8or thread in a forum and it's made up of 344 meshes, when I load the file just using the specs I've mentioned so far, it looks like the AT-AT is exploded out as if it were a diagram of how to make it (when loaded in Anim8or, all pieces are close and resemble a fully assembled AT-AT). All the pieces are oriented correctly and have the same up direction, but there's plenty of extra space between the pieces.
Does somebody know how to properly read a VTX file? Or know of a website that'll explain what those other numbers mean?
Edit:
The file extension .VTX is used for a lot of different things and has a lot of different structures depending on what the expected use is. Valve, Visio, Anim8or, and several others use VTX, I'm only interested in the VTX file that Anim8or exports and the structure that it uses.
I have been working on a 3D Modeling program myself and wanted a simple format to be able to bring objects in to the editor to be able to test the speed of my drawing routines with large sets of vertices and faces. I was looking for an easy one where I could get models quickly and found the .vtx format. I googled it and found your question. When I was unable to find the format on the internet, I played around and compared .OBJ exports with .vtx ones. (Maybe it was created just for Anim8or?) Here is what I found:
1) Yes, the vertices have eight numbers on each line. The first three are, as you guessed, the x, y, and z coordinates. The next three are the vertex normals, nx, ny, and nz. You may notice that each vertex appears multiple times with different normals for each face that contains it. The last two numbers are texture coordinates.
2) As for the faces, I reached the same conclusions as you did. The first three numbers are indices into the vertex list above. The last number does appear to always be -1. I am going to assume that it has something to do with the facing of the face. (e.g. facing in or out.) Since most models are created with the faces all facing appropriately, it stands to reason that this would be the same number for all of them.
3) One additional note: When comparing the .obj with the .vtx, I did notice that the positions of the vertices changed. This was also true when comparing with the .an8 file. This should not be a "HUGE" problem as long as they are all offset by the same amount in each vertex and every file. At least then it could be compensated for.
Have you considered using the .obj file format? It is text-based and is not extremely difficult to parse or understand. There is quite a bit of information about it online.
I am going to add that, after a few hours inspection, the vtx export in Anim8or seems to be broken. I experienced the same problem as you did that the pieces were not located properly. My assumption would be that anim8or exports these objects using the local coordinates for each mesh and not accounting for transformations that have been applied. I do also note that it will not IMPORT the vtx file...
Based on some googling, it seems you're at the wrong end of the pipeline. As I understand it: A VTX file is a Valve Proprietary File Format that is the result of a set of steps.
The final output of Studiomdl for each
Half-Life model is a group of files in
the gamedirectory/models folder ready
to be used by the Game Engine:
an .MDL
file which defines the structure of
the model along with animation,
bounding box, hit box, material, mesh
and LOD information,
a .VVD file which
stores position independent flat data
for the bone weights, normals,
vertices, tangents and texture
coordinates used by the MDL, currently
three separate types of VTX file:
.sw.vtx (Software),
.dx80.vtx (DirectX
8.0) and
.dx90.vtx (DirectX 9.0) which store hardware optimized material,
skinning and triangle strip/fan
information for each LOD of each mesh
in the MDL,
often a .PHY file
containing a rigid or jointed
(ragdoll) collision model, and
sometimes
a .ANI file for To do:
something to do with model animations
Valve
Now the Valve Source SDK may have some utilities in it to read VTX's (it seems to have the ability to make them anyway). Some people may have made 3rd party tools or have code to read them, but it's likely to not work on all files just cause it's a 3rd party format. I also found this post which might help if you haven't seen it before.
I am developing a game for the web. The map of this game will be a minimum of 2000km by 2000km. I want to be able to encode elevation and terrain type at some level of granularity - 100m X 100m for example.
For a 2000km by 2000km map storing this information in 100m2 buckets would mean 20000 by 20000 elements or a total of 400,000,000 records in a database.
Is there some other way of storing this type of information?
MORE INFORMATION
The map itself will not ever be displayed in its entirety. Units will be moved on the map in a turn based fashion and the players will get feedback on where they are located and what the local area looks like. Terrain will dictate speed and prohibition of movement.
I guess I am trying to say that the map will be used for the game and not necessarily for a graphical or display purposes.
It depends on how you want to generate your terrain.
For example, you could procedurally generate it all (using interpolation of a low resolution terrain/height map - stored as two "bitmaps" - with random interpolation seeded from the xy coords to ensure that terrain didn't morph), and use minimal storage.
If you wanted areas of terrain that were completely defined, you could store these separately and use them where appropriate, randomly generating the rest.)
If you want completely defined terrain, then you're going to need to look into some kind of compression/streaming technique to only pull terrain you are currently interested in.
I would treat it differently, by separating terrain type and elevation.
Terrain type, I assume, does not change as rapidly as elevation - there are probably sectors of the same type of terrain that stretch over much longer than the lowest level of granularity. I would map those sectors into database records or some kind of hash table, depending on performance, memory and other requirements.
Elevation I would assume is semi-contiuous, as it changes gradually for the most part. I would try to map the values into set of continuous functions (different sets between parts that are not continues, as in sudden change in elevation). For any set of coordinates for which the terrain is the same elevation or can be described by a simple function, you just need to define the range this function covers. This should reduce much the amount of information you need to record to describe the elevation at each point in the terrain.
So basically I would break down the map into different sectors which compose of (x,y) ranges, once for terrain type and once for terrain elevation, and build a hash table for each which can return the appropriate value as needed.
If you want the kind of granularity that you are looking for, then there is no obvious way of doing it.
You could try a 2-dimensional wavelet transform, but that's pretty complex. Something like a Fourier transform would do quite nicely. Plus, you probably wouldn't go about storing the terrain with a one-record-per-piece-of-land way; it makes more sense to have some sort of database field which can store an encoded matrix.
I think the usual solution is to break your domain up into "tiles" of manageable sizes. You'll have to add a little bit of logic to load the appropriate tiles at any given time, but not too bad.
You shouldn't need to access all that info at once--even if each 100m2 bucket occupied a single pixel on the screen, no screen I know of could show 20k x 20k pixels at once.
Also, I wouldn't use a database--look into height mapping--effectively using a black & white image whose pixel values represent heights.
Good luck!
That will be awfully lot of information no matter which way you look at it. 400,000,000 grid cells will take their toll.
I see two ways of going around this. Firstly, since it is a web-based game, you might be able to get a server with a decently sized HDD and store the 400M records in it just as you would normally. Or more likely create some sort of your own storage mechanism for efficiency. Then you would only have to devise a way to access the data efficiently, which could be done by taking into account the fact that you doubtfully will need to use it all at once. ;)
The other way would be some kind of compression. You have to be careful with this though. Most out-of-the-box compression algorithms won't allow you to decompress an arbitrary location in the stream. Perhaps your terrain data has some patterns in it you can use? I doubt it will be completely random. More likely I predict large areas with the same data. Perhaps those can be encoded as such?