The current implementation is blazing fast! (~5-9x faster than the original release, ~2-4x faster than this concurrent pytorch implementation) What's the secret sauce behind this speedup? Multiple ...