PhotoFlow optimizations and benchmarks

Carmelo_DrRaw · September 24, 2017, 1:05pm

After having fixed quite many bugs and made the code more stable, I have started to look into the problem of speed improvements, and I’ve set-up some simple benchmarks.

The first item I have worked on has been the Amaze demosaicing, which is derived from the corresponding RawTherapee code. So far the SSE2 optimizations were disabled, so I went on a enabled the SSE2 code for Amaze. This gives a nice 2x speed improvement for the demosaicing phase, which is really quite cool!

I’m running my benchmarks on a VirtualBox machine with Ubuntu 17.04 guest and two cores.
Photoflow is compiled with the following flags:

-std=gnu++11 -march=nocona -mno-sse3 -mtune=generic -g -O3

My term of comparison is RawTherapee 5.2 compiled from sources on the same virtual machine, with the following configuration:

Version: 5.2-190-gf0acd239
Branch: dev
Commit: f0acd239
Commit date: 2017-09-21
Compiler: cc 6.3.0
Processor: generic x86
System: Linux
Bit depth: 64 bits
Gtkmm: V3.22.0
Lensfun: V0.3.2.0
Build type: Release
Build flags:  -std=c++11 -mtune=generic -Werror=unused-label -fopenmp -Werror=unknown-pragmas -Wall -Wno-unused-result -Wno-deprecated-declarations -O3 -DNDEBUG
Link flags:  -mtune=generic
OpenMP support: ON
MMAP support: ON

Saving a Nikon D810 RAW file to Jpeg using the standard camera color matrix gives the following figures:

PhF, Amaze without SSE2: 6300 ms
PhF, Amaze with SSE2: 4800 ms
RT with neutral profile: 5300 ms

When the ICC conversion from the camera colorspace to sRGB is disabled, the time for saving a D810 RAW to Jpeg goes down to 2300 ms.

This already indicates that the next item worth some optimizations are the ICC conversions, which are currently entirely handled by LCMS2.

For comparison, saving the same D810 RAW file to Jpeg with the neutral profile in RT takes about 5300 ms, so slightly more than PhF. However, this is not a completely fair comparison, because RT goes through a chain of colorspace conversions (camera → working RGB → Lab → output RGB) even when the neutral profile is used, while PhF does only camera → output RGB.

This is just the tip of the iceberg, as lots of tools in PhF are in bad need of SSE and other code optimizations. I will post here progresses and benchmarks whenever I will have some nice new results… meanwhile, the optimized Amaze code will be committed in the next few days, after few more checks.

heckflosse · September 24, 2017, 2:10pm

Which amaze code did you use for your test in PhotoFlow? The one from your stable branch or the one from RT dev branch?

Carmelo_DrRaw · September 24, 2017, 2:20pm

I have taken this version from the dev branch.

heckflosse · September 24, 2017, 2:22pm

Ok, that’s the current version

afre · September 24, 2017, 3:43pm

Awesome! Partly why I report bugs and usage challenges for PF is that my system is very slow. A dull and slow knife cut is more painful so to speak. The more things are optimized the less the minor issues would be a hindrance. Usually, I need to wait for PF to process to a certain extent before I do something else; otherwise I may risk it becoming unstable.

patdavid · September 24, 2017, 7:04pm

https://twitter.com/GIMP_Official/status/910798211056586752

Might be worth pinging Oyvind (pippin)?

heckflosse · September 24, 2017, 9:13pm

I agree that the LCMS2 conversions are really slow and a bottleneck in rt processing too.

But another optimization really worth to integrate in PhotoFlow would be using the latest code for raw ca-correction

Carmelo_DrRaw · September 24, 2017, 9:58pm

Well, until recently the BABL/GEGL code has been very sRGB-centric, so not a good option… let’s see how it evolves.

heckflosse · September 25, 2017, 10:39pm

That’s not a problem. It’s fun

Elle · September 26, 2017, 8:58am

Well, until recently the BABL/GEGL code has been very sRGB-centric, so not a good option… let’s see how it evolves.

Pippin is working on babl code (not sRGB-centric) that handles RGB color space transforms without using LCMS2:

Carmelo_DrRaw · September 26, 2017, 9:26am

Thanks for the heads up Elle! Looking quickly through the code, I do not see much optimizations. Instead, it seems to implement an alternative CMS which does not depend at all on LCMS2.

For me this seems to be overkill… what I actually need with high priority is an optimized, fast implementation of ICC conversion between matrix profiles in relative colorimetric intent, as well as RGB ↔ Lab conversions, which cover 99% of the conversions commonly performed during photo editing.

LCMS2 already provides the infrastructure for reading or creating ICC profiles, and for retrieving the useful information (type of profile, colorants, TRCs, etc…) from them, and I am not planning to move away from it.
Moreover, the LCMS2 machinery is still very handy for more complex conversions (LUT profiles, non-relative rendering intents, partial adaptation, abstract profiles, etc…).

I have a big respect for Pippin, but in this case my impression is that he is re-inventing the wheel… are there plans to completely drop the dependency of GIMP from LCMS2? If yes, why?

houz · September 26, 2017, 9:57am

darktable has such code, in git it’s even SSE enabled. You might ping me on IRC so I can guide you to the places where to find the code.

Elle · September 26, 2017, 10:13am

I don’t know where the babl code is heading or what its ultimate use in GIMP might be. It’s being actively worked on, so not in a final state, but here are some thoughts:

Equations for matrix to matrix conversions and conversions between XYZ/LAB are independent of ICC profile color management. So hopefully doing the math directly can be faster than using LCMS2 to do the math.
My understanding is that Pippin’s code does take into account making a neutral gray axis, which is something LCMS2 doesn’t do.
It seems to me (and keep in mind I haven’t spent a lot of time examining the code) that Pippin’s code can be used with RGB color spaces with various white points, without invoking iccMAX, which seems like it might be very useful for certain types of workflows.
In GIMP commit 4cfeb53d095eff96e8f3bdfde319c06543e273c6 is an observation that sending RGB values to the screen is noticeably faster with the new code, though likely this isn’t true for all types of monitor profiles, and black point compensation isn’t (yet?) implemented.
GIMP uses babl to do TRC conversions, so far limited to conversions between the sRGB TRC and linear gamma TRC. But hopefully in the future this will be generalized to other TRCs, and hopefully such conversions can be done much more efficiently than would be the case if they were done by invoking a full LCMS2 ICC profile transform.

I don’t know what future coding plans might be, but this doesn’t seem very likely to me. The babl code focuses on RGB working spaces.

Tobias · September 26, 2017, 11:09am

I’ve just chatted with pippin:

< pippin>: elle is providing a good response on that forum, if someone wants to chip in with a comment from pippin…
< pippin>: I do not want babl to depend on an open-core project like lcms2, whoms performance is held hostage by the maintainer
< pippin>: using it for comparing accuracy/performance as well as for fallback for things not implemented, or not yet implemented is another matter

Carmelo_DrRaw · September 29, 2017, 6:35am

Thanks @houz! I guess I should look here and the functions being called from there?

houz · September 29, 2017, 9:00am

Exactly. It might be useful to use SSE when applying the tonecurve, too, but we haven’t done so yet. You also want to look at commit_params() where some setup is done, mostly getting the matrix if possible.

Carmelo_DrRaw · September 29, 2017, 10:04am

Here is the first update on the work on PhotoFlow optimization. This time I have included the most recent code for automatic Chromatic Aberrations corrections from RT, and enabled SSE2 optimizations.

Preamble: the improvements I am showing here are by no means the result of my own ideas. Instead, they come from the hard work done by @heckflosse and other RT developers! I have just taken the state-of-the-art RT code and plugged it into photoflow, with few modifications to adapt it to the photoflow processing pipeline.

Talking about the processing model, a big difference between RT and PF is that RT bases its parallel processing on OpenMP, while PF performs a parallel processing of image tiles using normal threads.

In the specific case of the CA correction, there is also another difference: in PF the analysis phase to derive the CA correction parameters is only performed once when the image is opened, while in RT it is AFAIK repeated each time the image is processed.

The benchmark is based as usual on an Ubuntu VM with 2 cores and 4GB of RAM, running in an OSX host with 4 cores and 8GB or memory.

Here are the results:

amsterdam.pef processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 1470 ms
old CA correction: 1700 ms
new CA correction, no SSE2: 1690 ms
new CA correction, with SSE2: 1630 ms (difference with/without CA: 160 ms)

The improvement is not dramatic, but still measurable and not zero.
amsterdam.pef processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 1490 ms
with CA correction: 1700 ms

RT and PF are very close here.

Differences become more prominent when processing bigger images like Nikon D810 RAWs:

D810 processed with PhotoFlow, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 4670 ms
new CA correction, with SSE2: 5190 ms (difference with/without CA: 520 ms)
D810 processed with RawTherapee, Amaze demosaicing and Jpeg sRGB output:
no CA correction: 5000 ms
new CA correction, with SSE2: 5850 ms (difference with/without CA: 850 ms)

Since the code used in the two programs is basically the same, I assume that the differences come from the fact that RT is repeating the CA analysis during the processing phase…

Carmelo_DrRaw · October 4, 2017, 10:35am

Few more optimizations introduced last week. This time I focused on the intermediate caching of image data, which is particularly important when filters with large input padding are involved (like large blurs or the “split details” module).

Now the code is able to automatically identify the the intermediate buffers for which padding is needed, and it introduces in-memory tile caches to avoid re-computation of pixels.

For example, the time required to export the amsterdam.pef image to Jpeg with an additional “split details” layer and 5 levels goes from 14600 ms without caching to 9250 ms with caching enabled.

The mechanism still requires some tweaking and further optimization, but the improvements are already non negligible.

Updated packages for the stable and linear_gamma branches can be found here and here respectively.

chroma_ghost · October 5, 2017, 1:01am

I’m liking a lot the feel of the the images processed in the linear version… there are some annoyances like layers that swap their places no matter what you do (same behaviour used to happen with clone layers and paths), like not being able to select blend mode of group layers/bins, LUT loading taking forever (at least .cube files), etc. But I’m sure those stalactites will be sand down and important thing right now is to say: CONGRATS Andrea and Thank YOU

Image developed in PhotoFlow unstable linear (amen) and finalised in gimp with G’mic and this LUT

Carmelo_DrRaw · October 5, 2017, 11:10am

This is now fixed in the latest packages on GitHub…

True, loading of .CUBE LUTs is not fast, but only happens once the first time the LUT is accessed. Then it should go much faster…

Could you provide me an example of .PFI file where this happens? I could not reproduce this problem myself.

Thanks!