I am looking for expert advices on a C/C++ loop optimization technique that, in theory, saves a little machine time. In practice, I don’t necessarily see much of a difference, but I admit that I haven’t done many comparison tests yet.
This happens in the fill(const char *const expression) function, when evaluating a math expression that returns a vector-valued result. In that case, for each (x,y,z) pixel of an image, the vector should be copied in the image. And I use a loop that looks like this:
const double *ptrs = res._data;
T *_ptrd = ptrd--;
for (unsigned int n = N; n>0; --n) { *_ptrd = (T)(*ptrs++); _ptrd+=whd; }
(vector values returned by the math evaluator are contiguous in memory, but vector values that must be copied in the image are not, they are separated by an offset whd).
I’ve tried to replace the loop above by
const double *const cimg_restrict ptrs = res._data;
T *cimg_restrict _ptrd = ptrd--;
cimg_pragma_openmp(simd)
for (unsigned int n = 0; n<N; ++n) _ptrd[n*whd] = (T)ptrs[n];
Is there anyone in the audience who is familiar with vectorizing loops with SIMD, and who can tell me if the second version has any chance of being faster than the original version?