Any thoughts on loop optimization?

I am looking for expert advices on a C/C++ loop optimization technique that, in theory, saves a little machine time. In practice, I don’t necessarily see much of a difference, but I admit that I haven’t done many comparison tests yet.

This happens in the fill(const char *const expression) function, when evaluating a math expression that returns a vector-valued result. In that case, for each (x,y,z) pixel of an image, the vector should be copied in the image. And I use a loop that looks like this:

const double *ptrs = res._data;
T *_ptrd = ptrd--; 
for (unsigned int n = N; n>0; --n) { *_ptrd = (T)(*ptrs++); _ptrd+=whd; }

(vector values returned by the math evaluator are contiguous in memory, but vector values that must be copied in the image are not, they are separated by an offset whd).

I’ve tried to replace the loop above by

const double *const cimg_restrict ptrs = res._data;
T *cimg_restrict _ptrd = ptrd--;
cimg_pragma_openmp(simd)
for (unsigned int n = 0; n<N; ++n) _ptrd[n*whd] = (T)ptrs[n];

Is there anyone in the audience who is familiar with vectorizing loops with SIMD, and who can tell me if the second version has any chance of being faster than the original version?

I was thinking of supporting unsafe multithreading, but don’t know if G’MIC eval/fill does that. Not a expert in C++.

OK so I’ve made a few heavy-computation tests.
The “optimized” version is in fact slower… :confused:
Interesting…

1 Like

If there isn’t going to be speedup, my suggestion is to allow users to define the number of threads to use in parallel operation. I have a separate command just for this when middle ground between 1 and 12 threads is faster. rep_mt_f32v_map basically.