PhotoFlow optimizations and benchmarks

Elle · September 26, 2017, 10:13am

I don’t know where the babl code is heading or what its ultimate use in GIMP might be. It’s being actively worked on, so not in a final state, but here are some thoughts:

Equations for matrix to matrix conversions and conversions between XYZ/LAB are independent of ICC profile color management. So hopefully doing the math directly can be faster than using LCMS2 to do the math.
My understanding is that Pippin’s code does take into account making a neutral gray axis, which is something LCMS2 doesn’t do.
It seems to me (and keep in mind I haven’t spent a lot of time examining the code) that Pippin’s code can be used with RGB color spaces with various white points, without invoking iccMAX, which seems like it might be very useful for certain types of workflows.
In GIMP commit 4cfeb53d095eff96e8f3bdfde319c06543e273c6 is an observation that sending RGB values to the screen is noticeably faster with the new code, though likely this isn’t true for all types of monitor profiles, and black point compensation isn’t (yet?) implemented.
GIMP uses babl to do TRC conversions, so far limited to conversions between the sRGB TRC and linear gamma TRC. But hopefully in the future this will be generalized to other TRCs, and hopefully such conversions can be done much more efficiently than would be the case if they were done by invoking a full LCMS2 ICC profile transform.

I don’t know what future coding plans might be, but this doesn’t seem very likely to me. The babl code focuses on RGB working spaces.

Tobias · September 26, 2017, 11:09am

I’ve just chatted with pippin:

< pippin>: elle is providing a good response on that forum, if someone wants to chip in with a comment from pippin…
< pippin>: I do not want babl to depend on an open-core project like lcms2, whoms performance is held hostage by the maintainer
< pippin>: using it for comparing accuracy/performance as well as for fallback for things not implemented, or not yet implemented is another matter

Carmelo_DrRaw · September 29, 2017, 6:35am

Thanks @houz! I guess I should look here and the functions being called from there?

houz · September 29, 2017, 9:00am

Exactly. It might be useful to use SSE when applying the tonecurve, too, but we haven’t done so yet. You also want to look at commit_params() where some setup is done, mostly getting the matrix if possible.

Carmelo_DrRaw · September 29, 2017, 10:04am

Here is the first update on the work on PhotoFlow optimization. This time I have included the most recent code for automatic Chromatic Aberrations corrections from RT, and enabled SSE2 optimizations.

Preamble: the improvements I am showing here are by no means the result of my own ideas. Instead, they come from the hard work done by @heckflosse and other RT developers! I have just taken the state-of-the-art RT code and plugged it into photoflow, with few modifications to adapt it to the photoflow processing pipeline.

Talking about the processing model, a big difference between RT and PF is that RT bases its parallel processing on OpenMP, while PF performs a parallel processing of image tiles using normal threads.

In the specific case of the CA correction, there is also another difference: in PF the analysis phase to derive the CA correction parameters is only performed once when the image is opened, while in RT it is AFAIK repeated each time the image is processed.

The benchmark is based as usual on an Ubuntu VM with 2 cores and 4GB of RAM, running in an OSX host with 4 cores and 8GB or memory.

Here are the results:

amsterdam.pef processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 1470 ms
old CA correction: 1700 ms
new CA correction, no SSE2: 1690 ms
new CA correction, with SSE2: 1630 ms (difference with/without CA: 160 ms)

The improvement is not dramatic, but still measurable and not zero.
amsterdam.pef processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 1490 ms
with CA correction: 1700 ms

RT and PF are very close here.

Differences become more prominent when processing bigger images like Nikon D810 RAWs:

D810 processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 4670 ms
new CA correction, with SSE2: 5190 ms (difference with/without CA: 520 ms)
D810 processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 5000 ms
new CA correction, with SSE2: 5850 ms (difference with/without CA: 850 ms)

Since the code used in the two programs is basically the same, I assume that the differences come from the fact that RT is repeating the CA analysis during the processing phase…

Carmelo_DrRaw · October 4, 2017, 10:35am

Few more optimizations introduced last week. This time I focused on the intermediate caching of image data, which is particularly important when filters with large input padding are involved (like large blurs or the “split details” module).

Now the code is able to automatically identify the the intermediate buffers for which padding is needed, and it introduces in-memory tile caches to avoid re-computation of pixels.

For example, the time required to export the amsterdam.pef image to Jpeg with an additional “split details” layer and 5 levels goes from 14600 ms without caching to 9250 ms with caching enabled.

The mechanism still requires some tweaking and further optimization, but the improvements are already non negligible.

Updated packages for the stable and linear_gamma branches can be found here and here respectively.

chroma_ghost · October 5, 2017, 1:01am

I’m liking a lot the feel of the the images processed in the linear version… there are some annoyances like layers that swap their places no matter what you do (same behaviour used to happen with clone layers and paths), like not being able to select blend mode of group layers/bins, LUT loading taking forever (at least .cube files), etc. But I’m sure those stalactites will be sand down and important thing right now is to say: CONGRATS Andrea and Thank YOU

Image developed in PhotoFlow unstable linear (amen) and finalised in gimp with G’mic and this LUT

Carmelo_DrRaw · October 5, 2017, 11:10am

This is now fixed in the latest packages on GitHub…

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

Thanks!

chroma_ghost · October 5, 2017, 11:32am

This is now fixed in the latest packages on GitHub…

Download and checked, super
also fancy sidecar detected dialog

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

is there a size of LUT that loads faster, or any other more suitable format 3dl, mga, look, png, etc.?

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

I think I know what was going on, I loaded a couple images and tried to mix them… it seems PhF always uses the background img (say first image loaded) as BKG or bottom layer; am I right, can I make BKG a “normal” layert? Probably that’s the logic behind gradients not working on the BKG layer and mix happening even in momentarily BKG on top. At least now I understand the “kid”, h eh e

Carmelo_DrRaw · April 9, 2018, 12:42pm

I have finally started to work on the optimization of ICC conversions. The first and most obvious case are RGB -> RGB conversions from and to linear matrix profiles, with relative colorimetric intent and no black point compensation.
In this specific case, the conversion can be reduced to the product of a 3x3 matrix and a 3-element RGB vector.
For the moment I implemented this fast path in straight C code, without SSE optimizations (see here). Even with this simple code, the gain is HUGE: on my test machine and with one single thread, the conversion of a Nikon D810 TIFF file in linear Rec.2020 colorspace into a linear sRGB Jpeg goes from 10s with LCMS2 to 900ms with the fast path, i.e. more than 10x faster!

The code is committed to github, and the most recent packages from today already take advantage of this enhancement.

I am not really sure if SSE2 optimizations would provide a large gain in this case, and if they are worth the effort. I started to read some examples of SSE2 optimizations for the vector dot product, and got the feeling that explicit optimizations might even result in slower code than what is generated by the compiler… maybe @heckflosse has some good advice on this?

heckflosse · April 9, 2018, 12:50pm

For this simple loop it’s likely that the compiler generates good vectorized code. To check, whether the compiler vectorizes a loop you can add this compiler switch -ftree-vectorizer-verbose=2.

Edit: I may be wrong for the loop mentioned above because the loop increment is 3, not 1 as one could assume when only looking at the loop header (which I did first). Anyway, verbose will tell you if the loop is vectorized.

heckflosse · April 9, 2018, 3:42pm

I wrote some similar code to convert from RGBRGBRGBRGB to LLLLaaaabbbb here. Maybe you can use it as a base for your code. As you need RGBRGBRGBRGB also for output, you have to add some shuffling.

Carmelo_DrRaw · April 9, 2018, 7:20pm

While starting to understand how a dot product should be implemented with SIMD instructions, I stumbled on this stack overflow answer, which seems to be pertinent: https://stackoverflow.com/a/17019970

Also, there is a SIMD library developed in the context of a CERN project, and which looks interesting: GitHub - edanor/umevector: Vectorization EDSL library

Knowing how CERN works, and how much effort they put in high-quality computing, I expect it to be well-written…

What do you think?

heckflosse · April 9, 2018, 7:34pm

I think, using RGBRGBRGBRGB as input and as output is not optimal for vectorizing. To get a good speedup, an input of RRRRGGGGBBBB is needed. To get an even better speedup, the output should have the same order. Then you can just load four floats (using SIMD), process them and save them.

As this is not the case in your example, you need to shuffle the input values, then let SIMD do its magic and then shuffle the values again for output (to again get RGBRGBRGBRGB).

Edit: Maybe there is a better way, but I really don’t want to think about it. Just optimize input and output for SIMD, then there is a good chance, you won’t even need to write SIMD code, because the vectorizer in the compiler already does that for you.

heckflosse · April 9, 2018, 10:20pm

Btw: my memory tells me that this could be done in roughly 50 ms (or less) using a 5 year-old 8-core AMD FX8350 for a 36 MPixel file if the data is ordered to fit SIMD operations.

Carmelo_DrRaw · April 10, 2018, 6:35am

I am starting to learn SIMD and vectorization… could you explain what would be the optimal memory layout of the matrix and RGB values for this matrix(3,3) x vector(3) product?

Thanks!

Carmelo_DrRaw · April 10, 2018, 6:37am

The figure I gave was for the whole conversion, including the JPEG encoding… I am still trying to benchmark the colorspace conversion alone.

heckflosse · April 10, 2018, 11:38am

Not for one matrix(3,3) x vector(3) product, but for many:

void PF::ICCTransform::apply(float* redin, float* greenin, float* bluein, float* redout, float* greenout, float* blueout, int n)
{
  if( is_rgb2rgb ) {
    /* std::cout<<"ICCTransform::apply(): in="<<(void*)in<<"  out="<<(void*)out<<std::endl;
    size_t addr = (size_t)in;
    float faddr = (float)addr;
    printf("    %f / 16 = %f\n",faddr,faddr/16);*/
//    float* in2 = in; float* out2 = out;
//    for(int i = 0; i < n; i++) {
//      out2[0] = rgb2rgb[0][0]*in2[0] + rgb2rgb[0][1]*in2[1] + rgb2rgb[0][2]*in2[2];
//      out2[1] = rgb2rgb[1][0]*in2[0] + rgb2rgb[1][1]*in2[1] + rgb2rgb[1][2]*in2[2];
//      out2[2] = rgb2rgb[2][0]*in2[0] + rgb2rgb[2][1]*in2[1] + rgb2rgb[2][2]*in2[2];
//      in2 += 3; out2 += 3;
//    }
#ifdef __SSE2__
    __m128 rgb2rgbv[3][3];
    for(int i = 0; i < 3; i++) {
        for(int j = 0; j < 3; j++) {
            rgb2rgbv[i][j] = _mm_set1_ps(rgb2rgb[i][j]);
        }
    }
#endif
    int i = 0;
#ifdef __SSE2__
    for(; i < n - 3; i += 4) {
        __m128 redv = _mm_loadu_ps(&redin[i]);
        __m128 greenv = _mm_loadu_ps(&greenin[i]);
        __m128 bluev = _mm_loadu_ps(&bluein[i]);
        _mm_storeu_ps(&redout[i], rgb2rgbv[0][0]*redv + rgb2rgbv[0][1]*greenv + rgb2rgbv[0][2]*bluev);
        _mm_storeu_ps(&greenout[i], rgb2rgbv[1][0]*redv + rgb2rgbv[1][1]*greenv + rgb2rgbv[1][2]*bluev);
        _mm_storeu_ps(&blueout[i], rgb2rgbv[2][0]*redv + rgb2rgbv[2][1]*greenv + rgb2rgbv[2][2]*bluev);
    }
#endif // __SSE2__
    for(; i < n; i++) { // remaining pixels if n % 4 != 0
        redout[i] = rgb2rgb[0][0]*redin[i] + rgb2rgb[0][1]*greenin[i] + rgb2rgb[0][2]*bluein[i];
        greenout[i] = rgb2rgb[1][0]*redin[i] + rgb2rgb[1][1]*greenin[i] + rgb2rgb[1][2]*bluein[i];
        blueout[i] = rgb2rgb[2][0]*redin[i] + rgb2rgb[2][1]*greenin[i] + rgb2rgb[2][2]*bluein[i];
        
    }

    return;
    //std::cout<<"out(1): "<<out[0]<<","<<out[1]<<","<<out[2]<<std::endl;
}

heckflosse · April 10, 2018, 9:04pm

Andrea, as a general hint: Don’t use the _mm_mul_ps, _mm_add_ps and so on intrinsics. Use a * b or a + b instead even for SIMD vectors (as I did in my example above). Compilers know how to translate that. It improves readability and often the code is faster than using the intrinsic code.

anon11264400 · April 11, 2018, 8:36pm

If it isn’t much work, stick OCIO transforms in. The GPU path for OCIO already has the V2 infrastructure being laid, compliments of Autodesk.

If you integrate OCIO, I can step you through or set up the matrix transforms required for the Nikon.