I don’t know what you miss there but want to give you a hint:
For better readability (which may also help to let you know what you are missing) you can write (at least in gcc and in clang)
__m128 r = __mm_setr_ps(...);
__m128 g = __mm_setr_ps(...);
__m128 b = __mm_setr_ps(...);
__m128 c = __mm_setr_ps(...);
__m128 result = (r + g + b ) * c;
instead of writing it in SSE notation which would be
__m128 r = __mm_setr_ps(...);
__m128 g = __mm_setr_ps(...);
__m128 b = __mm_setr_ps(...);
__m128 c = __mm_setr_ps(...);
__m128 result = _mm_mul_ps(c, _mm_add_ps(_mm_add_ps(r, g), b));
I use this notation almost always in RawTherapee code. It improves the readability of SSE code while also allowing further optimizations by the compiler.
@anon41087856 Glad to see that I’m not the only SSE coder here 