@RichardRegal I just tested again. At the time I made this optimizations, there was a very clearly measurable speedup for tiles per thread > 1. My new tests (with gcc 10.x) show not much speedup (if at all), probably because I’m testing with a native build where gcc inserts good prefetch instructions for my machine. I will test now with clang builds and also generic x86_64 builds to check whether we will keep this stuff or not. Will take a day or two though.