A response to the blog post "{n} times faster than C". Our final program achieved a speedup of 128x (36 GiB/s throughput) by reformulating the problem and leveraging SIMD intrinsics.
This post seems to have taken the title from a previous post that this is built upon. So that is probably why the title gets a bit confusing when viewed standalone.
This post seems to have taken the title from a previous post that this is built upon. So that is probably why the title gets a bit confusing when viewed standalone.