Graphics processing units provide an astonishing number of floating point operations per second and deliver memory bandwidths of one magnitude greater than common general purpose central processing units.
With the introduction of the Compute Unified Device Architecture, a first step was taken by nVIDIA to ease access to the vast computational resources of graphics processing units. The aim of this thesis is to shed light onto the general hard- and software structures of this promising architecture. In contrast to well established high performance architectures which offer moderate on chip parallelism, graphics processing units use massive parallelism at the thread level. Thus, parallelization approaches are required which exploit a substantially finer level of parallelism as compared to OpenMP parallelization on standard multi-core and multi-socket servers.
Basic benchmark kernels as well as libraries are investigated to demonstrate the basic parallelization approaches and potentials regarding peak performance and main memory bandwidth. A kernel from a computational fluid dynamics solver based on the lattice Boltzmann method is introduced and evaluated in terms of implementation issues and performance. Substantial work has to be invested in low level hand optimization to get the full capabilities of graphics processing units even for this simple computational fluid dynamics kernel. For selected verification cases, the optimized kernel outperforms a standard two socket server in single-precision accuracy by almost one order of magnitude.
Some of the material presented here may fall under Copyright © 2008 NVIDIA Corporation