Graphics processing units (GPU) are currently used as a cost-effective platform forcomputer simulations and big-data processing. Large scale applications require thatmultiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times,sub-optimal because the GPU features are not exploited at their best. We describe how itis possible to achieve an excellent efficiency for applications in statistical mechanics,particle dynamics and networks analysis by using suitable memory access patterns andmechanisms like CUDA streams, profiling tools, etc.