Optimize RawTherapee performance of tiled processings

RawTherapee has some tiled processings (amaze, rcd and xtrans demosaic, raw ca correction and some more)
For some of them I added the possibility to optimize processing time according to the machine you’re running rt on.

All you have to do is to enable ‘Measure’ in preferences:
grafik

Then start rt from console and process your images using different values for e.g. Amaze demosaic, Raw CA correction and so on). I suggest trying powers of 2 (1, 2, 4, 8, 16) and then narrow down between the two best values. Best way to measure is using the queue.

Example from my 8-core for a 50 MP file:
using value 1 (which is the value used before this optimization) for Amaze, Raw CA correction and rgb processing :

CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1005 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 795 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 468 ms
CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1014 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 841 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 476 ms
CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1022 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 860 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 475 ms

Using value 7 for Amaze and value 6 for Raw CA correction and rgb processing:

CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 916 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 710 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 378 ms
CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 932 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 710 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 389 ms
CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 938 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 690 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 417 ms

A small, but clear improvement.

4 Likes

Could this be automatic? Or does it depend on the input images too much?

In my tests it mostly depends on the machine (cpu and so on) you run rt. The dependency on input images is very low, though there is a difference between for example demosaicing 100 MP files and 1920x1080 video raw files. For the latter a smaller value than for the the 100 MP files gives better performance (though marginal).

I already thought about automating this, but not for rt 5.6

Great, so one would only need to conduct the measurements once, not that I need to because I don’t process enough files for it to matter.

1 Like

Some background:

What is a tiled proessing?

A tiled processing processes an images in tiles of a certain size. Assume you have an imag of size 1920x1280 pixels and the size of the tiles is 128x128 pixels.

That would look like this (each T represents a tile):

T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T

Tiles per thread = 1on an 8 core machine now could lead to this scheduling (number is the number of the core)

1 2 3 4 5 6 7 8 2 1 4 3 5 6 7
8 2 1 4 3 T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T

while tiles per thread = 4 could result in this scheduling

1 1 1 1 2 2 2 2 3 3 3 3 4 4 4
4 5 5 5 5 6 6 6 6 7 7 7 7 8 8
8 8 1 1 1 1 2 2 2 2 3 3 3 3 4
4 4 4 5 5 5 5 5 6 6 6 6 6 T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
1 Like

Exactly

so what number of tiles would give the best visual result :stuck_out_tongue: ? or is this whole thing only based on how fast things are done?

Only for speed. Visual result is the same.

1 Like

@heckflosse oh okay. It’s cool I guess that there are options to make it faster and more accessible for less powerful machines. I spoiled myself to a lovely 8 core 2 years ago and have nothing to complain about :slight_smile:

So happy about where RT is these days. So awesome.

2 Likes

I guess that with a large enough dataset we could find a common denominator in the optimal number of tiles.

I just figured out how to start RT GUI from console and take processing measurements. Do the optimal thread settings tend to hold consistent when batch processing with the command line version of Rawtherapee? What about different image resolutions?

On a slightly related topic, last time I did video processing with Rawtherapee via command line, I noticed that my CPU utilization hovered around 20%. Does that mean that I’d gain performance by running multiple instances of rawtherapee-cli? How would this factor in with which thread settings are optimal?

Also, I see only 3 demosaicing algorithms, AMaZE, RCD, and Xtrans, that have tiles/thread optimization settings. Are these the most computationally intensive demosaicing algorithms?

Yes, they do

Almost all algorithms in RT scale well with number of cores. The biggest exception are the decoders. For example decoding Nikon NEF uses only 1 core while decoding Sony arw uses all cores. Also writing the output file uses only 1 core and can take a while especially when writing compressed tiff

The other ones are not tiled

No, the slowest from the bayer-demosaicers I personally use are vng4 and lmmse

1 Like