5a32a9b8a5
* Fix datarace in CUDA's "cpy" kernel. * Remove extra barrier by using more of shared memory.
* Fix datarace in CUDA's "cpy" kernel. * Remove extra barrier by using more of shared memory.