GPU Programming: Culling Process and Resulting Deferred Shading

From the GBuffer data collected a straight implementation of Deferred shading is possible and can be done in the pixel shader. For every pixel, (or sample if using MSAA) the normals, albedo and position are read from the GBuffer. It is important to note that during the GBuffer pass, the associated buffers were attached to the pipeline as render targets via the immediate context. Before passing the buffers in as textures to be queried (shader resource views). They must first be removed as render targets. This can be done by passing in an array of null pointers as render targets before assigning the back buffer as the default target.

Once in the shader, all calculations are very simple and similar to forward rendering. Here phong lighting is used with the calculations performed in viewspace as the buffers stored all normal and position data in view space. The lights are passed in via a constant buffer or structured buffer and the positions have been transformed into viewspace.

4 Lights using deferred rendering

The main use for the compute shader in this portion of the rendering pass is to allow as many lights as possible in the scene, by only rendering lights that are required in sections. The first basic check can be done easily on the CPU by only passing in lights that might effect the objects in the view frustum. This will eliminate lights that do not have any effect on any scene object. For the tests however, all lights were placed in such a way as to effect at least part of the scene.

In the compute shader the idea of frustum culling was taken to an extreme. Thread groups were assigned to take a small square portion of the image and process the lights just for that section. This tiled approach was presented by DICE in siggraph 2011.

One purple light showing the tiles that are effected by it, the rest is filled with the normal map to show unaffected tiles

The process dispatches enough square thread groups to cover the entire images (using a ceiling function to ensure right and bottom edges are properly handled). Inside a thread group, each thread retrieves the GBuffer data including its z position in view space. Next shared memory is used to gather the nearest z position and farthest z position of the group. This will create the near and far plane of the thread groups frustum. Next, the projection matrix is created for the tile based on half the size of the image divided by the size of the group and the projection matrix of the scene camera. The thread group projection matrix also needs to be translated depending on the group's ID, such that the group matrix is shifted correctly based on the location of the tile in relation to the picture.

With a projection matrix created for the group, frustum planes can be extracted, and each light can be culled against these planes. The position and attenuation are taken into account for point lights and simple sphere plane intersections are performed. Rather than have each thread check all of the lights against the frustum created, this can be distributed over all of the thread in the group since they all share the same frustum. The list of lights is divided based on the number of threads in the group and each thread only has to work on numLights / (threadGroupWidth*threadGroupHeight). IDs are then stored if a light passes the frustum check again in group memory. A map approach was used to store the ideas with keys being generated with interlocking add calls to change the key every time a thread passed a light onto the list. Once all of the lights have been processed the rest of the shader can function similarly to the pixel shader approach, with the exception that the lights must be index into before processing from the map based on the IDs stored in shared memory. Any group that does not pass any lights will be filled with black.

Below are some images to show what an effect this culling can have. Every time the number of lights changed the maximum brightness for a light was altered to avoid a complete washout, as 4096 lights at full brightness was simply a white screen.