GPU Programming: 2012

Friday, 23 March 2012

Direct Lighting as Input for Single Bounce

More to come on this soon but here is my first inputs for using RSMs

Culling Process and Resulting Deferred Shading

From the GBuffer data collected a straight implementation of Deferred shading is possible and can be done in the pixel shader. For every pixel, (or sample if using MSAA) the normals, albedo and position are read from the GBuffer. It is important to note that during the GBuffer pass, the associated buffers were attached to the pipeline as render targets via the immediate context. Before passing the buffers in as textures to be queried (shader resource views). They must first be removed as render targets. This can be done by passing in an array of null pointers as render targets before assigning the back buffer as the default target.

Once in the shader, all calculations are very simple and similar to forward rendering. Here phong lighting is used with the calculations performed in viewspace as the buffers stored all normal and position data in view space. The lights are passed in via a constant buffer or structured buffer and the positions have been transformed into viewspace.

4 Lights using deferred rendering

The main use for the compute shader in this portion of the rendering pass is to allow as many lights as possible in the scene, by only rendering lights that are required in sections. The first basic check can be done easily on the CPU by only passing in lights that might effect the objects in the view frustum. This will eliminate lights that do not have any effect on any scene object. For the tests however, all lights were placed in such a way as to effect at least part of the scene.

In the compute shader the idea of frustum culling was taken to an extreme. Thread groups were assigned to take a small square portion of the image and process the lights just for that section. This tiled approach was presented by DICE in siggraph 2011.

One purple light showing the tiles that are effected by it, the rest is filled with the normal map to show unaffected tiles

The process dispatches enough square thread groups to cover the entire images (using a ceiling function to ensure right and bottom edges are properly handled). Inside a thread group, each thread retrieves the GBuffer data including its z position in view space. Next shared memory is used to gather the nearest z position and farthest z position of the group. This will create the near and far plane of the thread groups frustum. Next, the projection matrix is created for the tile based on half the size of the image divided by the size of the group and the projection matrix of the scene camera. The thread group projection matrix also needs to be translated depending on the group's ID, such that the group matrix is shifted correctly based on the location of the tile in relation to the picture.

With a projection matrix created for the group, frustum planes can be extracted, and each light can be culled against these planes. The position and attenuation are taken into account for point lights and simple sphere plane intersections are performed. Rather than have each thread check all of the lights against the frustum created, this can be distributed over all of the thread in the group since they all share the same frustum. The list of lights is divided based on the number of threads in the group and each thread only has to work on numLights / (threadGroupWidth*threadGroupHeight). IDs are then stored if a light passes the frustum check again in group memory. A map approach was used to store the ideas with keys being generated with interlocking add calls to change the key every time a thread passed a light onto the list. Once all of the lights have been processed the rest of the shader can function similarly to the pixel shader approach, with the exception that the lights must be index into before processing from the map based on the IDs stored in shared memory. Any group that does not pass any lights will be filled with black.

Below are some images to show what an effect this culling can have. Every time the number of lights changed the maximum brightness for a light was altered to avoid a complete washout, as 4096 lights at full brightness was simply a white screen.

1024 lights small attenuation no fall off

4096 lights with a large attenuation and linear fall off

4096 lights with small attenuation and no fall off

4096 lights with small attenuation and linear fall off

Gathering data for deferred processing

After creating the base test with convolution matrices, the project moved onto the main artifact creation of complex lighting optimised through parallel processing techniques in the compute shader. The first phase of this was to generate the large amount of data to process to simulate performance cliffs in game situations. The data is stored in multiple structured buffers and textures (2D, 2DMS, and potentially 3D), and create enough lighting situations to require non trivial ALU computations.

With the render pipeline in mind, deferred shading is the only approach that would utilise the geometry and pixel functions of the hardware while allowing computations at a separate stage. The unoptimised approach renders all of the scene geometry into several full screen buffers. The outputs are the depth buffer, a buffer for the albedo values, a buffer with the 3D positions (in a given space usually world or view), and a buffer for the normals.

Albedo Buffer

Normal Buffer

The first optimisation was to not store a position buffer, as the position can be reconstructed from the depth buffer and the screen x,y coordinates (or thread dispatch ID if in the compute shader). The trade off for removal of this buffer is a transformation from screen space x,y to homogeneous clip space to view or world space. The vector math is trivial for the hardware and is much more advantageous than storing in memory with the texture grab requirement. Further optimisations can be made through reducing the amount of data that must be stored and packing unused elements with lighting data. For example the normals can be represented as two floats rather than three through spherical or stereographic mapping since the length of each normal will always be 1 and the z component can be reconstructed in similar fashion to map projections. If the normal only uses 2 floats the rest of the pixel information can pack the specular amounts and powers and reduce the need for an extra buffer.

Another step that is still in progress is to store the gradient of the position z component. Along with the normal information the change in z can help identify changes in surfaces from one object to another form a screen space buffer. This will be useful to create edge detection algorithms as extra sampling is usually required at the edges of objects.

delta z buffer

This buffer might also be able to be optimised as it appears the rate of change will not require the full 32 bits and could be packed to hold more data.

With all of the data created using shaders as dictated for each model, but sent to MRT rather than the back buffer, now all calculations can be done independently from the individual objects and their materials and now work in terms of screen space.

Friday, 16 March 2012

Convolution Update

A quick follow up to the compute shader blur post. The main advantage of using the compute shader over the pixel shader for a per pixel operation is the ability to store sampled data in shared memory for the the thread group. While the texture cache of the pixel shader might be faster than storing the data in shared memory for small blocks, it is not able to cache large areas of an input and is why the shared memory becomes the fastest option for convolutions with larger kernel radii (usually at least 5 or 7 as seen from the output). The drawback is the context switching between dispatch and draw calls, but since convolution is done as a post process switching (and hence stalling) can be minimised. Some of the main optimisations come from dealing with the borders of the thread groups. A basic (and this programmer's first) implementation would use N number of threads to output N - 2*KernelRadius pixels. The result is overlap of each thread group by 2*KernelRadius pixels both horizontally and vertically. Redundancy limits the possible gains of the shader program, and also busy the particular thread such that it cannot be scheduled effectively. An alternative is to have N number of threads output N pixels by having all threads put one sample into the shared memory and have 2*KernelRadius threads store an extra pixel (from the extreme of the thread group extents). The result is a noticeable speed increase at the cost of larger memory consumption of 2*KernelRadius pixels per thread group shared memory. The next step is to best optimise and test excessive memory grabs compared to the otherwise required branch statement to populate the entired shared memory before moving onto the using the collected data.

Thursday, 1 March 2012

Post Processing on Pixel and Compute

In the last two weeks the project has moved from pre-production into full production. The base framework is in for the render pipeline, allowing multiple render targets and dynamically setting buffers, states and shaders to the pipeline, rather than using predefined effects. This has been very helpful in the testing of post process effects where much of the geometry stages (vertex, domain, hull, and geometry shaders) remain unchanged. It also allows for run time toggling of shaders to quickly do comparison tests. All of the base functionality is in, with the exception of model loading that should be in the next week or so.

The big features this update include functionality for chaining any number of render targets from one another, as well as dynamically accessing the texture data as a render target, shader resource view or an unordered access view. While structured buffers were used initially, rwtexturebuffers are now used at outputs from the shaders to avoid a potential cpu dump to texture copy for some processes. The result of M-RTs was the base line post process of a Gaussian blur using pascal triangle weights and a 7x7 kernel.

Original frame buffer

PS blur with a 5x5 Kernel avg fps 776

Both the pixel shader and the compute shader use a separable filter approach so that the performance can be based on the optimisations of the compute shader to avoid redundant texture calls through the utilisation of thread group shared memory not the kernel approach.

The compute shader has several different optimisations relating to the dimensions and size of the thread groups as well as the dispatch, in addition to the use of the thread group memory, more details of the different flavors of Gaussian compute shader will follow in the next post.

Compute Shader Blur 7x7 Kernel avg fps1243

Quick References

This post is just a quick collection of references for porting math code between DX9/DX10 to DX11.

First, the maths library from dx9/10 is no longer in use in 11. It can still be compiled into a project but the d3dx vectors and matrices are not in the 11 library files or headers. Instead it has been replaced with DirectXMath (or if you are still using the most recent windows 7 sdk it is called XNAMath). Both client sides are the same between the two names, but the new(est) DirectXMath header has more functionality according to msdn. This is a bit confusing for inclusion of a d3d project for two reasons. First, these math libraries are not included with the dx sdk download, but rather the Microsoft windows 7.x (or 8) sdk. So a separate download is required. Second, msdn's getting started page states: "Replace the "xnamath.h" header with "DirectXMath.h" and add "DirectXPackedVector.h" for the "GPU packed" types." This will only work if metro apps, or another windows 8 preview is installed, for anyone without a preview install still needs to use the xnamath.h (instead of the d3dxMath.h). The details can be found here:

http://msdn.microsoft.com/en-us/library/windows/desktop/ff729728%28v=vs.85%29.aspx

The main difference is using the XMVECTOR when using a function call and storing the result for a class or other container that may want element access as a XMFLOAT4. The conversion between the two types can be perform with XMStore and XMLoad functions.

For shader authoring a small but important change between 10 and 11 is the transition of default row major matrices to column matrices. This port conversion is simple but worth stating. Either transpose row matrices prior to sending to the shader, or in the declaration of the matrix type in hlsl prefix it with row_major. Otherwise WVP matrices might be giving odd results.

Friday, 10 February 2012

Profiler Data

This week I've been busy setting up my environment messing about with APP Profiler session data, GPU ShaderAnalyzer output and trying out the client/server design of GPU PerfStudio2. These tools provide detailed data sheets, tons of command line options and more raw info than I know what to do with.

Offline Data from CS in VS2010 and *.hlsl files in ShaderAnalyzer

The GPU PerfStudio2 was very impressive as it profiles an application while the application is running through the PerfStudio server and can be set in a similar way to setting break points and captures a full frame. This is where the fun begins...

Benchmark in GPU PerfStudio2, on a single draw call at the VS stage

So far I have been able to dissect the deferred lighting benchmark through time comparisons and resource consumption. This tool allows me to step through each call and provides an easy to read graph (bottom of image) that visualizes the overall frame time device call by call. It also allows for on the fly shader editing which I think will really speed up the shader iterations during the production stage. These tool provide so much information however that I need a way to collect and manage the outputs in an automated way. Each tool provides methods for invoking the application from the command line and saving out images, CSVs and application specific files. Hopefully I will come up with a reasonable way to make order out of all of this information in the next week or so.

As far as progress on coding, I wrote another sample program that utilizes the compute shader as well as a basic render path this time based off of the directX addition sample. This difference from the example (or my previous post) is that I wanted to take a closer look at the interactions between the shader and the resource views, as well as set up my custom build steps for effects as well as individual shader functions. This time I did not use any of the pipeline set up from the baseline but rather manually created the required buffers and used the context pointer to switch between shaders as needed. I then removed any calls that created a complete effect with the technique in the shader and created the component shader stages individually and set them in the *.cpp. I think this will give me a lot of flexibility in the next few weeks so that I do not have to duplicate my shaders just because they are linked in file specific technique. I will probably use a bit of both, but it was good to implement the idea quickly with this simple example.

Simple use of multiple SRV, UAV and structured buffers

Lastly, I have been trying out a few different optimization flavors using the FXC.exe tool. I am currently setting a custom build step for my *.hlsl / *.fx files so that I will get syntax errrors at compile time rather than runtime. Currently, I am including the step depending on what type of shader it is and stating the entry point in the param list. I would like to automate this more so that every time I draft a shader it will apply the correct steps. Maybe I will use a different suffix for each type of shader.

Tuesday, 7 February 2012

First Compute Shader

This week I am still very much in pre-production. I am still setting up my system for profiling cpu and shader resources so I thought I would make a couple of hello world shaders and see what I need to do in order to collect the performance data I want. I am focusing on DirectCompute's impact on render techniques (even though the compute shader can be used for many other things like physics) so my first shader reads in a texture, manipulates the data and then displays the output texture.


Input Texture

The compute shader take the input texture and multiplies it with the a horizontally and vertically flipped version of the input and then passes the resulting data out as a texture which is then rendered to the screen as:

Output from Compute Shader

The CS ends up creating a "thread" (term used by DC, even though it is not exactly a true thread) for each pixel on the screen. The breakdown is as follows:

32, 24, 1 Thread groups (so a 2D grid 32 x 24)

20, 20, 1 Threads per group (so another 2D grid this time of 20x20)

The result is 640 x 480 total threads, and as the input image is 640 x 480 this relates to one thread per pixel

While this is not exactly exciting output, it has given me an initial program that I can use with my performance analyser tools, and it has the basics of the applications I will be writing in the production phase. It has a DirectX Render context, and it uses the compute shader to produce a final texture that is rendered to the screen.

This test is also the start of my DirectX 11 framework, that will no doubt grow and improve as the project progresses, and hopefully these tests will help me determine what to include in my Renderer.

Project Overview

I've started this blog to keep track of the progress of my 4th year honours project. During the first semester I researched and proposed a project that uses GPGPU programming to improve the efficiency of common multi-pass render techniques. In January, my project was approved and now I am starting fun part of the project: the execution phase.

Below is the abstract from my proposal that gives a good overview of what I am attempting to accomplish. In the following weeks, I will update my progress coding the project, both in pre-production of tools to help collect metrics and the production phase that will implement some of these render techniques.

ABSTRACT:

Large quality differences still exist between real-time rendering in games and offline CGI used in film. However, expectations for high quality images in games continue to rise as CG is increasingly used in everyday media. Despite improvements to GPU hardware, modern rendering approaches continue to suffer from scaling issues and performance cliffs: a non-trivial obstacle to developers’ attempts to increase fidelity in geometric, illumination, and post-process techniques.

This project proposes optimisations to modern game rendering pipelines that use programmable stages on the GPU, principally GPGPU solutions implemented with DirectCompute. The proposed research will implement data-parallel designs in a high fidelity case study. Profiling of this artifact will be used to qualify overheads, and to evaluate potential design patterns that may be applied to a wider range of multi-pass render techniques. The case study will consist of reorganising and revising a deferred shading, single bounce indirect illumination algorithm, as this lighting solution decouples geometric and lighting complexities and can be parallelised effectively. Additionally, this approach can extend not only to other illumination algorithms but to other, traditionally multi-pass, geometric and post-process techniques.