A quick follow up to the compute shader blur post. The main advantage of using the compute shader over the pixel shader for a per pixel operation is the ability to store sampled data in shared memory for the the thread group. While the texture cache of the pixel shader might be faster than storing the data in shared memory for small blocks, it is not able to cache large areas of an input and is why the shared memory becomes the fastest option for convolutions with larger kernel radii (usually at least 5 or 7 as seen from the output). The drawback is the context switching between dispatch and draw calls, but since convolution is done as a post process switching (and hence stalling) can be minimised. Some of the main optimisations come from dealing with the borders of the thread groups. A basic (and this programmer's first) implementation would use N number of threads to output N - 2*KernelRadius pixels. The result is overlap of each thread group by 2*KernelRadius pixels both horizontally and vertically. Redundancy limits the possible gains of the shader program, and also busy the particular thread such that it cannot be scheduled effectively. An alternative is to have N number of threads output N pixels by having all threads put one sample into the shared memory and have 2*KernelRadius threads store an extra pixel (from the extreme of the thread group extents). The result is a noticeable speed increase at the cost of larger memory consumption of 2*KernelRadius pixels per thread group shared memory. The next step is to best optimise and test excessive memory grabs compared to the otherwise required branch statement to populate the entired shared memory before moving onto the using the collected data.
No comments:
Post a Comment