Unlocking GPU Intrinsics in HLSL

- Unlocking GPU intrinsics in HLSL allows shaders to use special instructions for exchanging data between threads in a warp without shared memory.
- These intrinsics are encoded as special sequences of regular HLSL instructions that the driver can recognize and optimize.
- The sequences use atomic operations on a UAV buffer, which is a fake buffer that is not actually used by the shader.
- To use these intrinsics, the application must allocate a UAV slot and tell the driver which slot to use.
- By using these intrinsics, the performance of pixel shaders can be improved.
- The intrinsics can be used in any pixel shader and are expanded into specific GPU instructions by the driver's JIT compiler.
- To create a runtime object for an extended shader in DirectX 11, the extension must be specified using the NVAPI function nvShaderExtn_CreateShader() before and after calling ID3D11Device::CreatePixelShader().
- NVIDIA GPUs support more than just warp shuffle intrinsics, and the complete list can be found in the NVAPI header file called nvHLSLExtns.h.
- In DirectX 12 and Vulkan, these intrinsics are available through cross-vendor intrinsics, and NVAPI is not required.
- It is important to make sure the application does not have the D3DCompiler_NO_XBOX_MS extension flag specified when using these intrinsics.
- There are resources available for more use cases and examples of using intrinsics, such as the NVIDIA NVAPI SDK and the NVAPI GitHub repository.