“Programming Massively Parallel Processors (second edition)” by Kirk and Hwu is a very good second book for those interested in getting started with CUDA. A first must-read is “CUDA by Example: An Introduction to General-Purpose GPU Programming” by Jason Sanders. After reading all of Sanders work, feel free to jump right to chapters 8 and 9 of this Kirk and Hwu publication.
In chapter 8, the authors do a nice job of explaining how to write an efficient convolution algorithm that is useful for smoothing and sharpening data sets. Their explanation of how shared memory can play a key role in improving performance is well written. They also handle the issue of “halo” data very well. Benchmark data would have served as a nice conclusion to this chapter.
In chapter 9, the authors provide the best description of the Prefix Sum algorithm I have seen to date. It describes the problem being solved in terms that I can easily relate to - food. They write, “We can illustrate the applications of inclusive scan operations using an example of cutting sausage for a group of people.” They first describe a simple algorithm, then a “work-efficient” algorithm, and then an extension for larger data sets. What puzzles me here is that the authors seem fixated on solving the problem with the least number of total operations (across all threads) as opposed to the least number of operations per thread. They do not mention that the “work-efficient” algorithm requires almost twice as many more operations for the longest-path thread than the simple algorithm. Actual performance benchmarks showing a net throughput gain would be required for a skeptical reader.
Now before moving forward, let's back up a bit. Even though we have already read CUDA by Example, it is worth reading chapter 6… at least the portion regarding the reduction algorithm starting at the top of page 128. The discussion is rather well written and insightful. Now, onward.
In chapter 13, the authors list the tree-fold goals of parallel computing: solve a given problem in less time, solve bigger problems in the same amount of time, and achieve better solutions for a given problem in a given amount of time. These all make sense, but have not been the reasons I have witnessed for the transition to parallel computing. I believe the biggest motivation for utilizing CUDA is to solve problems that would otherwise be unsolvable. For example, the rate of data generated by many scientific instruments could simply not be processed without a massively parallel computing solution. In other words, CUDA makes things possible.
Also in Chapter 13, they bring up a very important point. Solving problems with thousands of threads requires that software developers think differently. To think of the resources of a GPU as a means by which you can make a parallel-for-loop run faster completely misses the point – and the opportunity the GPU provides. These three chapters then make the book worthwhile.
The chapters on OpenCL, OpenACC, and AMP seem a bit misplaced in a book like this. The author’s coverage of these topics is a bit too superficial to make them useful for serious developers. On page 402, they list the various data types that AMP supports. It would have made sense for the authors to point out that AMP does not support byte and short. When processing large data sets of these types, AMP introduces serious performance penalties.
This then brings me to my biggest concern about this book. There is very little attention paid to the technique of overlapping data transfer operations and with kernel execution. I did happen upon a discussion of streaming in chapter 19, “Programming a Heterogeneous Computing Cluster.” However, the context of the material is with respect to MPI, and those not interested in MPI might easily miss it. Because overlapping I/O with kernel operations can easily double throughput, I believe this topic deserves at least one full chapter. Perhaps in the next edition, we can insert it between chapters 8 and 9? Oh, and let’s add “Overlapped I/O”, “Concurrent” and “Streams” as first class citizens in the index. While we are editing the index, let’s just drop the entry for “Apple’s iPhone Interfaces”. Seriously.
In summary, I believe this is a very helpful book and well written. I would consider it a good resource for CUDA developers. It is not, however, a must-have CUDA resource.