An optimized D2Q37 Lattice Boltzmann code on GP-GPUs

We describe the implementation of a thermal compressible Lattice Boltzmann algorithm on an NVIDIA Tesla C2050 system based on the Fermi GP-GPU. We consider two different versions, including and not including reactive effects. We describe the overall organization of the algorithm and give details on its implementations. Efficiency ranges from 25% to 31% of the double precision peak performance of the GP-GPU. We compare our results with a different implementation of the same algorithm, developed and optimized for many-core Intel Westmere CPUs. (C) 2012 Elsevier Ltd. All rights reserved.
Biferale, Luca and Mantovani, Filippo and Pivanti, Marcello and Pozzati, Fabio and Sbragaglia, Mauro and Scagliarini, Andrea and Schifano, Sebastiano Fabio and Toschi, Federico and Tripiccione, Raffaele
Pergamon Press.
Computers & fluids