GPU Programming: February 2012

Friday, 10 February 2012

Profiler Data

This week I've been busy setting up my environment messing about with APP Profiler session data, GPU ShaderAnalyzer output and trying out the client/server design of GPU PerfStudio2. These tools provide detailed data sheets, tons of command line options and more raw info than I know what to do with.

Offline Data from CS in VS2010 and *.hlsl files in ShaderAnalyzer

The GPU PerfStudio2 was very impressive as it profiles an application while the application is running through the PerfStudio server and can be set in a similar way to setting break points and captures a full frame. This is where the fun begins...

Benchmark in GPU PerfStudio2, on a single draw call at the VS stage

So far I have been able to dissect the deferred lighting benchmark through time comparisons and resource consumption. This tool allows me to step through each call and provides an easy to read graph (bottom of image) that visualizes the overall frame time device call by call. It also allows for on the fly shader editing which I think will really speed up the shader iterations during the production stage. These tool provide so much information however that I need a way to collect and manage the outputs in an automated way. Each tool provides methods for invoking the application from the command line and saving out images, CSVs and application specific files. Hopefully I will come up with a reasonable way to make order out of all of this information in the next week or so.

As far as progress on coding, I wrote another sample program that utilizes the compute shader as well as a basic render path this time based off of the directX addition sample. This difference from the example (or my previous post) is that I wanted to take a closer look at the interactions between the shader and the resource views, as well as set up my custom build steps for effects as well as individual shader functions. This time I did not use any of the pipeline set up from the baseline but rather manually created the required buffers and used the context pointer to switch between shaders as needed. I then removed any calls that created a complete effect with the technique in the shader and created the component shader stages individually and set them in the *.cpp. I think this will give me a lot of flexibility in the next few weeks so that I do not have to duplicate my shaders just because they are linked in file specific technique. I will probably use a bit of both, but it was good to implement the idea quickly with this simple example.

Simple use of multiple SRV, UAV and structured buffers

Lastly, I have been trying out a few different optimization flavors using the FXC.exe tool. I am currently setting a custom build step for my *.hlsl / *.fx files so that I will get syntax errrors at compile time rather than runtime. Currently, I am including the step depending on what type of shader it is and stating the entry point in the param list. I would like to automate this more so that every time I draft a shader it will apply the correct steps. Maybe I will use a different suffix for each type of shader.

Tuesday, 7 February 2012

First Compute Shader

This week I am still very much in pre-production. I am still setting up my system for profiling cpu and shader resources so I thought I would make a couple of hello world shaders and see what I need to do in order to collect the performance data I want. I am focusing on DirectCompute's impact on render techniques (even though the compute shader can be used for many other things like physics) so my first shader reads in a texture, manipulates the data and then displays the output texture.


Input Texture

The compute shader take the input texture and multiplies it with the a horizontally and vertically flipped version of the input and then passes the resulting data out as a texture which is then rendered to the screen as:

Output from Compute Shader

The CS ends up creating a "thread" (term used by DC, even though it is not exactly a true thread) for each pixel on the screen. The breakdown is as follows:

32, 24, 1 Thread groups (so a 2D grid 32 x 24)

20, 20, 1 Threads per group (so another 2D grid this time of 20x20)

The result is 640 x 480 total threads, and as the input image is 640 x 480 this relates to one thread per pixel

While this is not exactly exciting output, it has given me an initial program that I can use with my performance analyser tools, and it has the basics of the applications I will be writing in the production phase. It has a DirectX Render context, and it uses the compute shader to produce a final texture that is rendered to the screen.

This test is also the start of my DirectX 11 framework, that will no doubt grow and improve as the project progresses, and hopefully these tests will help me determine what to include in my Renderer.

Project Overview

I've started this blog to keep track of the progress of my 4th year honours project. During the first semester I researched and proposed a project that uses GPGPU programming to improve the efficiency of common multi-pass render techniques. In January, my project was approved and now I am starting fun part of the project: the execution phase.

Below is the abstract from my proposal that gives a good overview of what I am attempting to accomplish. In the following weeks, I will update my progress coding the project, both in pre-production of tools to help collect metrics and the production phase that will implement some of these render techniques.

ABSTRACT:

Large quality differences still exist between real-time rendering in games and offline CGI used in film. However, expectations for high quality images in games continue to rise as CG is increasingly used in everyday media. Despite improvements to GPU hardware, modern rendering approaches continue to suffer from scaling issues and performance cliffs: a non-trivial obstacle to developers’ attempts to increase fidelity in geometric, illumination, and post-process techniques.

This project proposes optimisations to modern game rendering pipelines that use programmable stages on the GPU, principally GPGPU solutions implemented with DirectCompute. The proposed research will implement data-parallel designs in a high fidelity case study. Profiling of this artifact will be used to qualify overheads, and to evaluate potential design patterns that may be applied to a wider range of multi-pass render techniques. The case study will consist of reorganising and revising a deferred shading, single bounce indirect illumination algorithm, as this lighting solution decouples geometric and lighting complexities and can be parallelised effectively. Additionally, this approach can extend not only to other illumination algorithms but to other, traditionally multi-pass, geometric and post-process techniques.