PhotoFlow optimizations and benchmarks

I think, using RGBRGBRGBRGB as input and as output is not optimal for vectorizing. To get a good speedup, an input of RRRRGGGGBBBB is needed. To get an even better speedup, the output should have the same order. Then you can just load four floats (using SIMD), process them and save them.

As this is not the case in your example, you need to shuffle the input values, then let SIMD do its magic and then shuffle the values again for output (to again get RGBRGBRGBRGB).

Edit: Maybe there is a better way, but I really don’t want to think about it. Just optimize input and output for SIMD, then there is a good chance, you won’t even need to write SIMD code, because the vectorizer in the compiler already does that for you.

1 Like