Optimize RawTherapee performance of tiled processings


(Ingo Weyrich) #1

RawTherapee has some tiled processings (amaze, rcd and xtrans demosaic, raw ca correction and some more)
For some of them I added the possibility to optimize processing time according to the machine you’re running rt on.

All you have to do is to enable ‘Measure’ in preferences:
grafik

Then start rt from console and process your images using different values for e.g. Amaze demosaic, Raw CA correction and so on). I suggest trying powers of 2 (1, 2, 4, 8, 16) and then narrow down between the two best values. Best way to measure is using the queue.

Example from my 8-core for a 50 MP file:
using value 1 (which is the value used before this optimization) for Amaze, Raw CA correction and rgb processing :

CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1005 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 795 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 468 ms
CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1014 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 841 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 476 ms
CA correcting 8196x6152 image with 1 tiles per thread
CA correction took 1022 ms
Demosaicing 8196x6152 image using AMaZE with 1 Tiles per Thread
amaze demosaic took 860 ms
rgb processing 8188x6144 image with 1 tiles per thread
rgb processing took 475 ms

Using value 7 for Amaze and value 6 for Raw CA correction and rgb processing:

CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 916 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 710 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 378 ms
CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 932 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 710 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 389 ms
CA correcting 8196x6152 image with 6 tiles per thread
CA correction took 938 ms
Demosaicing 8196x6152 image using AMaZE with 7 Tiles per Thread
amaze demosaic took 690 ms
rgb processing 8188x6144 image with 6 tiles per thread
rgb processing took 417 ms

A small, but clear improvement.


#2

Could this be automatic? Or does it depend on the input images too much?


(Ingo Weyrich) #3

In my tests it mostly depends on the machine (cpu and so on) you run rt. The dependency on input images is very low, though there is a difference between for example demosaicing 100 MP files and 1920x1080 video raw files. For the latter a smaller value than for the the 100 MP files gives better performance (though marginal).

I already thought about automating this, but not for rt 5.6


#4

Great, so one would only need to conduct the measurements once, not that I need to because I don’t process enough files for it to matter.


(Ingo Weyrich) #5

Some background:

What is a tiled proessing?

A tiled processing processes an images in tiles of a certain size. Assume you have an imag of size 1920x1280 pixels and the size of the tiles is 128x128 pixels.

That would look like this (each T represents a tile):

T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T

Tiles per thread = 1on an 8 core machine now could lead to this scheduling (number is the number of the core)

1 2 3 4 5 6 7 8 2 1 4 3 5 6 7
8 2 1 4 3 T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T

while tiles per thread = 4 could result in this scheduling

1 1 1 1 2 2 2 2 3 3 3 3 4 4 4
4 5 5 5 5 6 6 6 6 7 7 7 7 8 8
8 8 2 2 2 2 4 4 4 4 3 3 3 3 T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T
T T T T T T T T T T T T T T T

(Ingo Weyrich) #6

Exactly


(Stefan Chirila) #7

so what number of tiles would give the best visual result :stuck_out_tongue: ? or is this whole thing only based on how fast things are done?


(Ingo Weyrich) #8

Only for speed. Visual result is the same.


(Stefan Chirila) #9

@heckflosse oh okay. It’s cool I guess that there are options to make it faster and more accessible for less powerful machines. I spoiled myself to a lovely 8 core 2 years ago and have nothing to complain about :slight_smile:

So happy about where RT is these days. So awesome.


(Roel) #10

I guess that with a large enough dataset we could find a common denominator in the optimal number of tiles.