Working with Multiple Accelerators in C++ AMP
- 9/15/2012
Falling Back to the CPU
If no C++ AMP-capable GPUs are available, your application could default back to parallel implementation on the CPU by using the PPL or the C++ AMP WARP accelerator. The section “Enumerating Accelerators” covers how to enumerate the available accelerators and choose the best one for your application. By default, the C++ AMP run time will fall back to the WARP accelerator if it is available (on Windows 8) and no C++ AMP-capable GPU accelerators are present.
Using a WARP accelerator allows your application to run on the CPU the same code that runs on the GPU, so there is less code to maintain. The WARP accelerator takes advantage of multicore and SIMD instructions and can result in comparable or even better performance than PPL code running on the CPU. This is particularly true if your algorithm would have been implemented on the CPU in a data-parallel way. Coding your algorithm in C++ AMP makes it simpler for the compiler to make good use of all the CPU cores and to vectorize your code.
In some cases, you might be able to use a different algorithm and data structures on the CPU to improve the performance that C++ AMP code running on WARP would achieve. This is especially true if there is a very efficient task-parallel approach that maps better to a multicore CPU than the data-parallel C++ AMP code. The case studies included in this book illustrate these tradeoffs.
The NBody case study (see Chapter 2) does not use WARP; if no suitable GPU is available, it falls back to a custom implementation written for the CPU, the advanced CPU integrator. The advanced CPU integrator is able to halve the number of force calculations by taking advantage of the force particle A exerts on particle B being the exact opposite of the force particle B exerts on particle A. It also breaks down the calculation in such a way as to maximize cache coherence, and therefore it improves core utilization as the application becomes memory-bound. The advanced CPU integrator also uses explicitly coded SSE vectorization using intrinsic functions. This also improves the performance of the advanced CPU integrator. In contrast, the NBody sample’s C++ AMP integrators rely on the massive data parallelism of the GPUs and directly calculate both forces for each particle pair. Incurring the additional cost of these calculations is more efficient on the GPU than implementing an integrator that tries to take advantage of the pair calculations with a much more complex kernel.
The Reduction case study has no code to detect or choose accelerators and compares both sequential and parallel CPU implementations to C++ AMP implementations on every run. The copy time, whether to a GPU accelerator or to a WARP accelerator, outweighs the execution time, but the Reduction code might be appropriate for C++ AMP if it were part of a larger calculation that could justify the copy time. On a variety of hardware, the execution time on WARP was never less than the CPU execution time, but the most optimized WARP time was not significantly more than the CPU time. It’s possible that the effort saved by not needing to maintain separate CPU and accelerator versions of the same algorithms would be significant. In that case, getting roughly the same performance on WARP and not needing to write a CPU version would be a good solution, producing an application that runs on a variety of hardware without needing to be written twice.
In Chapter 10, the Cartoonizer case study shows an example in which WARP delivers better performance than the CPU implementation. In this case, the CPU code uses the same data-parallel algorithm as the C++ AMP code and relies on the C++ compiler’s autovectorization features to take advantage of SIMD. The C++ AMP implementation using WARP runs faster than the CPU implementation because it is able to better take advantage of all the cores and their vector processing units.
Few developers can afford to declare that their application won’t run on hardware that doesn’t include a DirectX 11 accelerator. Whether you choose to support configurations without a hardware accelerator by using WARP or by creating a CPU-based implementation using PPL—and possibly SSE—largely depends on the nature of your application. WARP might well be the best choice if your algorithm is data parallel and does not use double precision or it’s not possible to take advantage of task parallelism on the CPU to write a more efficient implementation.