PhotoFlow optimizations and benchmarks

Ok, that’s the current version :+1:

1 Like

Awesome! Partly why I report bugs and usage challenges for PF is that my system is very slow. A dull and slow knife cut is more painful so to speak. The more things are optimized the less the minor issues would be a hindrance. Usually, I need to wait for PF to process to a certain extent before I do something else; otherwise I may risk it becoming unstable.

1 Like

https://twitter.com/GIMP_Official/status/910798211056586752

Might be worth pinging Oyvind (pippin)?

I agree that the LCMS2 conversions are really slow and a bottleneck in rt processing too.

But another optimization really worth to integrate in PhotoFlow would be using the latest code for raw ca-correction :wink:

1 Like

Well, until recently the BABL/GEGL code has been very sRGB-centric, so not a good option… let’s see how it evolves.

That’s not a problem. It’s fun :wink:

1 Like

Well, until recently the BABL/GEGL code has been very sRGB-centric, so not a good option… let’s see how it evolves.

Pippin is working on babl code (not sRGB-centric) that handles RGB color space transforms without using LCMS2:

Thanks for the heads up Elle! Looking quickly through the code, I do not see much optimizations. Instead, it seems to implement an alternative CMS which does not depend at all on LCMS2.

For me this seems to be overkill… what I actually need with high priority is an optimized, fast implementation of ICC conversion between matrix profiles in relative colorimetric intent, as well as RGB ↔ Lab conversions, which cover 99% of the conversions commonly performed during photo editing.

LCMS2 already provides the infrastructure for reading or creating ICC profiles, and for retrieving the useful information (type of profile, colorants, TRCs, etc…) from them, and I am not planning to move away from it.
Moreover, the LCMS2 machinery is still very handy for more complex conversions (LUT profiles, non-relative rendering intents, partial adaptation, abstract profiles, etc…).

I have a big respect for Pippin, but in this case my impression is that he is re-inventing the wheel… are there plans to completely drop the dependency of GIMP from LCMS2? If yes, why?

1 Like

darktable has such code, in git it’s even SSE enabled. You might ping me on IRC so I can guide you to the places where to find the code.

I don’t know where the babl code is heading or what its ultimate use in GIMP might be. It’s being actively worked on, so not in a final state, but here are some thoughts:

  • Equations for matrix to matrix conversions and conversions between XYZ/LAB are independent of ICC profile color management. So hopefully doing the math directly can be faster than using LCMS2 to do the math.
  • My understanding is that Pippin’s code does take into account making a neutral gray axis, which is something LCMS2 doesn’t do.
  • It seems to me (and keep in mind I haven’t spent a lot of time examining the code) that Pippin’s code can be used with RGB color spaces with various white points, without invoking iccMAX, which seems like it might be very useful for certain types of workflows.
  • In GIMP commit 4cfeb53d095eff96e8f3bdfde319c06543e273c6 is an observation that sending RGB values to the screen is noticeably faster with the new code, though likely this isn’t true for all types of monitor profiles, and black point compensation isn’t (yet?) implemented.
  • GIMP uses babl to do TRC conversions, so far limited to conversions between the sRGB TRC and linear gamma TRC. But hopefully in the future this will be generalized to other TRCs, and hopefully such conversions can be done much more efficiently than would be the case if they were done by invoking a full LCMS2 ICC profile transform.

I don’t know what future coding plans might be, but this doesn’t seem very likely to me. The babl code focuses on RGB working spaces.

I’ve just chatted with pippin:

< pippin>: elle is providing a good response on that forum, if someone wants to chip in with a comment from pippin…
< pippin>: I do not want babl to depend on an open-core project like lcms2, whoms performance is held hostage by the maintainer
< pippin>: using it for comparing accuracy/performance as well as for fallback for things not implemented, or not yet implemented is another matter

Thanks @houz! I guess I should look here and the functions being called from there?

Exactly. It might be useful to use SSE when applying the tonecurve, too, but we haven’t done so yet. You also want to look at commit_params() where some setup is done, mostly getting the matrix if possible.

1 Like

Here is the first update on the work on PhotoFlow optimization. This time I have included the most recent code for automatic Chromatic Aberrations corrections from RT, and enabled SSE2 optimizations.

Preamble: the improvements I am showing here are by no means the result of my own ideas. Instead, they come from the hard work done by @heckflosse and other RT developers! I have just taken the state-of-the-art RT code and plugged it into photoflow, with few modifications to adapt it to the photoflow processing pipeline.

Talking about the processing model, a big difference between RT and PF is that RT bases its parallel processing on OpenMP, while PF performs a parallel processing of image tiles using normal threads.

In the specific case of the CA correction, there is also another difference: in PF the analysis phase to derive the CA correction parameters is only performed once when the image is opened, while in RT it is AFAIK repeated each time the image is processed.

The benchmark is based as usual on an Ubuntu VM with 2 cores and 4GB of RAM, running in an OSX host with 4 cores and 8GB or memory.

Here are the results:

  • amsterdam.pef processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
    no CA correction: 1470 ms
    old CA correction: 1700 ms
    new CA correction, no SSE2: 1690 ms
    new CA correction, with SSE2: 1630 ms (difference with/without CA: 160 ms)

    The improvement is not dramatic, but still measurable and not zero.

  • amsterdam.pef processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
    no CA correction: 1490 ms
    with CA correction: 1700 ms

RT and PF are very close here.

Differences become more prominent when processing bigger images like Nikon D810 RAWs:

  • D810 processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
    no CA correction: 4670 ms
    new CA correction, with SSE2: 5190 ms (difference with/without CA: 520 ms)

  • D810 processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
    no CA correction: 5000 ms
    new CA correction, with SSE2: 5850 ms (difference with/without CA: 850 ms)

Since the code used in the two programs is basically the same, I assume that the differences come from the fact that RT is repeating the CA analysis during the processing phase…

Few more optimizations introduced last week. This time I focused on the intermediate caching of image data, which is particularly important when filters with large input padding are involved (like large blurs or the “split details” module).

Now the code is able to automatically identify the the intermediate buffers for which padding is needed, and it introduces in-memory tile caches to avoid re-computation of pixels.

For example, the time required to export the amsterdam.pef image to Jpeg with an additional “split details” layer and 5 levels goes from 14600 ms without caching to 9250 ms with caching enabled.

The mechanism still requires some tweaking and further optimization, but the improvements are already non negligible.

Updated packages for the stable and linear_gamma branches can be found here and here respectively.

1 Like

I’m liking a lot the feel of the the images processed in the linear version… there are some annoyances like layers that swap their places no matter what you do (same behaviour used to happen with clone layers and paths), like not being able to select blend mode of group layers/bins, LUT loading taking forever (at least .cube files), etc. But I’m sure those stalactites will be sand down and important thing right now is to say: CONGRATS Andrea and Thank YOU :vulcan_salute:

 
Image developed in PhotoFlow unstable linear (amen) and finalised in gimp with G’mic and this LUT

3 Likes

This is now fixed in the latest packages on GitHub…

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

Thanks!

1 Like

This is now fixed in the latest packages on GitHub…

Download and checked, super :ok_hand:
also fancy sidecar detected dialog :santa::whale::peach:

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

is there a size of LUT that loads faster, or any other more suitable format 3dl, mga, look, png, etc.?

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

I think I know what was going on, I loaded a couple images and tried to mix them… it seems PhF always uses the background img (say first image loaded) as BKG or bottom layer; am I right, can I make BKG a “normal” layert? Probably that’s the logic behind gradients not working on the BKG layer and mix happening even in momentarily BKG on top. At least now I understand the “kid”, h eh e

I have finally started to work on the optimization of ICC conversions. The first and most obvious case are RGB -> RGB conversions from and to linear matrix profiles, with relative colorimetric intent and no black point compensation.
In this specific case, the conversion can be reduced to the product of a 3x3 matrix and a 3-element RGB vector.
For the moment I implemented this fast path in straight C code, without SSE optimizations (see here). Even with this simple code, the gain is HUGE: on my test machine and with one single thread, the conversion of a Nikon D810 TIFF file in linear Rec.2020 colorspace into a linear sRGB Jpeg goes from 10s with LCMS2 to 900ms with the fast path, i.e. more than 10x faster!

The code is committed to github, and the most recent packages from today already take advantage of this enhancement.

I am not really sure if SSE2 optimizations would provide a large gain in this case, and if they are worth the effort. I started to read some examples of SSE2 optimizations for the vector dot product, and got the feeling that explicit optimizations might even result in slower code than what is generated by the compiler… maybe @heckflosse has some good advice on this?

2 Likes

For this simple loop it’s likely that the compiler generates good vectorized code. To check, whether the compiler vectorizes a loop you can add this compiler switch -ftree-vectorizer-verbose=2.

Edit: I may be wrong for the loop mentioned above because the loop increment is 3, not 1 as one could assume when only looking at the loop header (which I did first). Anyway, verbose will tell you if the loop is vectorized.