PhotoFlow optimizations and benchmarks

This is now fixed in the latest packages on GitHub…

Download and checked, super :ok_hand:
also fancy sidecar detected dialog :santa::whale::peach:

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

is there a size of LUT that loads faster, or any other more suitable format 3dl, mga, look, png, etc.?

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

I think I know what was going on, I loaded a couple images and tried to mix them… it seems PhF always uses the background img (say first image loaded) as BKG or bottom layer; am I right, can I make BKG a “normal” layert? Probably that’s the logic behind gradients not working on the BKG layer and mix happening even in momentarily BKG on top. At least now I understand the “kid”, h eh e

I have finally started to work on the optimization of ICC conversions. The first and most obvious case are RGB -> RGB conversions from and to linear matrix profiles, with relative colorimetric intent and no black point compensation.
In this specific case, the conversion can be reduced to the product of a 3x3 matrix and a 3-element RGB vector.
For the moment I implemented this fast path in straight C code, without SSE optimizations (see here). Even with this simple code, the gain is HUGE: on my test machine and with one single thread, the conversion of a Nikon D810 TIFF file in linear Rec.2020 colorspace into a linear sRGB Jpeg goes from 10s with LCMS2 to 900ms with the fast path, i.e. more than 10x faster!

The code is committed to github, and the most recent packages from today already take advantage of this enhancement.

I am not really sure if SSE2 optimizations would provide a large gain in this case, and if they are worth the effort. I started to read some examples of SSE2 optimizations for the vector dot product, and got the feeling that explicit optimizations might even result in slower code than what is generated by the compiler… maybe @heckflosse has some good advice on this?

2 Likes

For this simple loop it’s likely that the compiler generates good vectorized code. To check, whether the compiler vectorizes a loop you can add this compiler switch -ftree-vectorizer-verbose=2.

Edit: I may be wrong for the loop mentioned above because the loop increment is 3, not 1 as one could assume when only looking at the loop header (which I did first). Anyway, verbose will tell you if the loop is vectorized.

I wrote some similar code to convert from RGBRGBRGBRGB to LLLLaaaabbbb here. Maybe you can use it as a base for your code. As you need RGBRGBRGBRGB also for output, you have to add some shuffling.

1 Like

While starting to understand how a dot product should be implemented with SIMD instructions, I stumbled on this stack overflow answer, which seems to be pertinent: https://stackoverflow.com/a/17019970

Also, there is a SIMD library developed in the context of a CERN project, and which looks interesting: GitHub - edanor/umevector: Vectorization EDSL library

Knowing how CERN works, and how much effort they put in high-quality computing, I expect it to be well-written…

What do you think?

I think, using RGBRGBRGBRGB as input and as output is not optimal for vectorizing. To get a good speedup, an input of RRRRGGGGBBBB is needed. To get an even better speedup, the output should have the same order. Then you can just load four floats (using SIMD), process them and save them.

As this is not the case in your example, you need to shuffle the input values, then let SIMD do its magic and then shuffle the values again for output (to again get RGBRGBRGBRGB).

Edit: Maybe there is a better way, but I really don’t want to think about it. Just optimize input and output for SIMD, then there is a good chance, you won’t even need to write SIMD code, because the vectorizer in the compiler already does that for you.

1 Like

Btw: my memory tells me that this could be done in roughly 50 ms (or less) using a 5 year-old 8-core AMD FX8350 for a 36 MPixel file if the data is ordered to fit SIMD operations.

I am starting to learn SIMD and vectorization… could you explain what would be the optimal memory layout of the matrix and RGB values for this matrix(3,3) x vector(3) product?

Thanks!

The figure I gave was for the whole conversion, including the JPEG encoding… I am still trying to benchmark the colorspace conversion alone.

Not for one matrix(3,3) x vector(3) product, but for many:

void PF::ICCTransform::apply(float* redin, float* greenin, float* bluein, float* redout, float* greenout, float* blueout, int n)
{
  if( is_rgb2rgb ) {
    /* std::cout<<"ICCTransform::apply(): in="<<(void*)in<<"  out="<<(void*)out<<std::endl;
    size_t addr = (size_t)in;
    float faddr = (float)addr;
    printf("    %f / 16 = %f\n",faddr,faddr/16);*/
//    float* in2 = in; float* out2 = out;
//    for(int i = 0; i < n; i++) {
//      out2[0] = rgb2rgb[0][0]*in2[0] + rgb2rgb[0][1]*in2[1] + rgb2rgb[0][2]*in2[2];
//      out2[1] = rgb2rgb[1][0]*in2[0] + rgb2rgb[1][1]*in2[1] + rgb2rgb[1][2]*in2[2];
//      out2[2] = rgb2rgb[2][0]*in2[0] + rgb2rgb[2][1]*in2[1] + rgb2rgb[2][2]*in2[2];
//      in2 += 3; out2 += 3;
//    }
#ifdef __SSE2__
    __m128 rgb2rgbv[3][3];
    for(int i = 0; i < 3; i++) {
        for(int j = 0; j < 3; j++) {
            rgb2rgbv[i][j] = _mm_set1_ps(rgb2rgb[i][j]);
        }
    }
#endif
    int i = 0;
#ifdef __SSE2__
    for(; i < n - 3; i += 4) {
        __m128 redv = _mm_loadu_ps(&redin[i]);
        __m128 greenv = _mm_loadu_ps(&greenin[i]);
        __m128 bluev = _mm_loadu_ps(&bluein[i]);
        _mm_storeu_ps(&redout[i], rgb2rgbv[0][0]*redv + rgb2rgbv[0][1]*greenv + rgb2rgbv[0][2]*bluev);
        _mm_storeu_ps(&greenout[i], rgb2rgbv[1][0]*redv + rgb2rgbv[1][1]*greenv + rgb2rgbv[1][2]*bluev);
        _mm_storeu_ps(&blueout[i], rgb2rgbv[2][0]*redv + rgb2rgbv[2][1]*greenv + rgb2rgbv[2][2]*bluev);
    }
#endif // __SSE2__
    for(; i < n; i++) { // remaining pixels if n % 4 != 0
        redout[i] = rgb2rgb[0][0]*redin[i] + rgb2rgb[0][1]*greenin[i] + rgb2rgb[0][2]*bluein[i];
        greenout[i] = rgb2rgb[1][0]*redin[i] + rgb2rgb[1][1]*greenin[i] + rgb2rgb[1][2]*bluein[i];
        blueout[i] = rgb2rgb[2][0]*redin[i] + rgb2rgb[2][1]*greenin[i] + rgb2rgb[2][2]*bluein[i];
        
    }

    return;
    //std::cout<<"out(1): "<<out[0]<<","<<out[1]<<","<<out[2]<<std::endl;
}
1 Like

Andrea, as a general hint: Don’t use the _mm_mul_ps, _mm_add_ps and so on intrinsics. Use a * b or a + b instead even for SIMD vectors (as I did in my example above). Compilers know how to translate that. It improves readability and often the code is faster than using the intrinsic code.

If it isn’t much work, stick OCIO transforms in. The GPU path for OCIO already has the V2 infrastructure being laid, compliments of Autodesk.

If you integrate OCIO, I can step you through or set up the matrix transforms required for the Nikon.

3 Likes