Unlocking GPU Intrinsics in HLSL

Unlocking GPU intrinsics in HLSL allows shaders to use special instructions for exchanging data between threads in a warp without shared memory.
These intrinsics are encoded as special sequences of regular HLSL instructions that the driver can recognize and optimize.
The sequences use atomic operations on a UAV buffer, which is a fake buffer that is not actually used by the shader.
To use these intrinsics, the application must allocate a UAV slot and tell the driver which slot to use.
By using these intrinsics, the performance of pixel shaders can be improved.
The intrinsics can be used in any pixel shader and are expanded into specific GPU instructions by the driver's JIT compiler.
To create a runtime object for an extended shader in DirectX 11, the extension must be specified using the NVAPI function nvShaderExtn_CreateShader() before and after calling ID3D11Device::CreatePixelShader().
NVIDIA GPUs support more than just warp shuffle intrinsics, and the complete list can be found in the NVAPI header file called nvHLSLExtns.h.
In DirectX 12 and Vulkan, these intrinsics are available through cross-vendor intrinsics, and NVAPI is not required.
It is important to make sure the application does not have the D3DCompiler_NO_XBOX_MS extension flag specified when using these intrinsics.
There are resources available for more use cases and examples of using intrinsics, such as the NVIDIA NVAPI SDK and the NVAPI GitHub repository.