OpenCL tuning in Darktable...

Has anyone experimented much with this yet.

I am trying to evaluate it.

I have three batch files. One runs darktable-cli by default and so should use the settings from the main configuration file. I have two versions one with and one without opencl disabled. I also have one that runs with the added text of --core --configdir “C:\altconfig” I run the DT with the same configdir and make changes in the GUI processing tab or to darktablerc in that folder.

The idea is to have standard benchmarks which I leave at defaults or a desired setting and one to tweak settings for comparison (ALTCONFIG).

I have found at the moment that for me setting the tuning to none or transfer is the fastest. Using memory or memory and transfer give slower results on the 3 files I tested.

There is likely a post or maybe one of the PR’s around this has some more info but I was just wondering when I set it to none what values does it use. I’ll go digging but I just thought I would ask.

I think this is working however with the new parameters and some old ones still in the darktablerc config files like headroom micronap etc it is hard (for me) to decipher what is an is not being used…

For example I know that micronap was not recommended to be set at 0 but I found it stable on my system and faster but I can’t seem to set it to that even when I enter it anymore. The number of handles also seems fixed, opencl_mandatory_timeout=2000 will change if I edit that but it seems like the only one…I see now some grouped settings that likely reflect the resource setting. I have not tried to alter any of those.

I’ll play a bit more and just wait for some documetation to surface later if I don’t stumble on to anything but if anyone has any comments, suggestions or experience so far experimenting with this I wouldn’t mind hearing about it…

Edit:

I see that if I start with an empty directory and get a brand new config file and not an updated one then most of the old parameters are gone so I guess in the updated file they remain as entries but are no longer used ?? Or so it would seem…

Exactly :slight_smile: DT keeps old conf data in the file.

Hi, I don’ t think I can help to answer your questions, but I will share some observations that I made.
My system: iMac, 3.8 GHz 8 core Intel Core i7, memory 64 GB, AMD Radeon pro 5700 xt 16 GB and OS 11.5.1.
I have a .nef file with a fix development that I use for testing which has two instances of diffuse and sharpen, first with 16 iterations and second with 6.
I use the current build of @MStraeten and run darktable -d perf to check timing when exporting a .jpg file.
The fastest results (20 sec for export) are when resources are set to: small; very fast GPU; and tune memory size and transfer.
Any other setting in resources results in export timing around 22 sec.
Looking at the timing I found that the difference is coming from the second instance of diffuse and sharp:

resources small:
431,989984 [dev_pixelpipe] took 12,733 secs (6,436 CPU) processed `diffuse or sharpen’ on GPU with tiling, blended on CPU [export]

resources default similar with large):
492,464406 [dev_pixelpipe] took 14,032 secs (1,048 CPU) processed `diffuse or sharpen’ on GPU, blended on GPU [export]

Hope all this is not very confusing, I am not at all a computer guy, but I was expecting, may be naively that with small resources things will go slow, and also that tiling was going to slow down the process, but results point the other way.
As I said before not answering your questions, and probably adding some new ones. If you had any suggesting I will be glad to test!!

Thanks I do something similar. I use a Sony file, a Google Pixel raw with multiple diffuse and some denoise and a few other modules and the common file/xmp pair Bench.SRW I have seen others use for running opencl benchmarks. Interesting that it is fastest with small … I did not try that option on my system…I tested all the options using large as the setting…

I have put a comment in under your most recent commit… I was back doing some testing tonight and I noticed that if I set the optimization to nothing in preferences and then manually edited darktablerc…all parameters in the set of pinned memory, handles etc …could be changed and the values would be honored the next time I used the program…but running either darktable-cli or the GUI would change any micronap value to 250.

I was curious about this so I did a few tests with a file(26MP) with a couple of diffuse or sharpen modules and other edits. This is a week old build of darktable.

I run a 3700x, 16GB of DDR4 3000mhz cl15 + zram for a total of 32GB and an rtx 3080 (10GB version). My OS is Arch Linux with the zen kernel.

I ran these a few times each to guarantee consistent results:

Unrestricted: [dev_process_export] pixel pipeline processing took 5.309 secs (10.217 CPU)

Full Run

476.679307 [dev] took 0.000 secs (0.000 CPU) to load the image. 476.766566 [export] creating pixelpipe took 0.071 secs (0.299 CPU) 476.766613 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export] 476.777980 [dev_pixelpipe] took 0.011 secs (0.009 CPU) processed 'raw black/white point' on GPU, blended on GPU [export] 476.784519 [dev_pixelpipe] took 0.007 secs (0.003 CPU) processed 'white balance' on GPU, blended on GPU [export] 476.789879 [dev_pixelpipe] took 0.005 secs (0.000 CPU) processed 'highlight reconstruction' on GPU, blended on GPU [export] 477.120035 [dev_pixelpipe] took 0.330 secs (0.194 CPU) processed 'demosaic' on GPU, blended on GPU [export] 477.275090 [dev_pixelpipe] took 0.155 secs (0.063 CPU) processed 'denoise (profiled)' on GPU, blended on GPU [export] 477.286887 [dev_pixelpipe] took 0.012 secs (0.002 CPU) processed 'lens correction' on GPU, blended on GPU [export] 478.009802 [dev_pixelpipe] took 0.723 secs (6.642 CPU) processed 'chromatic aberrations' on CPU, blended on CPU [export] 478.066841 [dev_pixelpipe] took 0.057 secs (0.051 CPU) processed 'exposure' on GPU, blended on GPU [export] 478.085059 [dev_pixelpipe] took 0.018 secs (0.006 CPU) processed 'input color profile' on GPU, blended on GPU [export] image colorspace transform Lab-->RGB took 0.005 secs (0.001 GPU) [channelmixerrgb ] 478.109439 [dev_pixelpipe] took 0.024 secs (0.005 CPU) processed 'color calibration' on GPU, blended on GPU [export] 479.649611 [dev_pixelpipe] took 1.540 secs (1.430 CPU) processed 'diffuse or sharpen' on GPU, blended on GPU [export] 481.106347 [dev_pixelpipe] took 1.457 secs (1.233 CPU) processed 'diffuse or sharpen 1' on GPU, blended on GPU [export] 481.446900 [dev_pixelpipe] took 0.341 secs (0.072 CPU) processed 'diffuse or sharpen 2' on GPU, blended on GPU [export] 481.620135 [dev_pixelpipe] took 0.173 secs (0.037 CPU) processed 'diffuse or sharpen 3' on GPU, blended on GPU [export] 481.635950 [dev_pixelpipe] took 0.016 secs (0.004 CPU) processed 'color balance rgb' on GPU, blended on GPU [export] 481.772579 [dev_pixelpipe] took 0.137 secs (0.020 CPU) processed 'filmic rgb' on GPU, blended on GPU [export] image colorspace transform RGB-->Lab took 0.005 secs (0.000 GPU) [bilat ] 481.989914 [dev_pixelpipe] took 0.217 secs (0.023 CPU) processed 'local contrast' on GPU, blended on GPU [export] 482.007919 [dev_pixelpipe] took 0.018 secs (0.006 CPU) processed 'output color profile' on GPU, blended on GPU [export] 482.076001 [dev_pixelpipe] took 0.068 secs (0.417 CPU) processed 'display encoding' on CPU, blended on CPU [export] 482.076039 [dev_process_export] pixel pipeline processing took 5.309 secs (10.217 CPU)

Large: 634.279522 [dev_process_export] pixel pipeline processing took 5.390 secs (10.458 CPU)

Small: [dev_process_export] pixel pipeline processing took 10.178 secs (46.917 CPU)

Interestingly I saw a few warnings in the output:

[pixelpipe_process_on_CPU] Warning: processes 'demosaic' even if memory requirements are not met

[pixelpipe_process_on_CPU] Warning: processes 'bilat' even if memory requirements are not met

Why does darktable print the module names like so: `module’ instead of ‘module’? Makes it quite a bummer to paste into MD compatible text files/or forum posts.

1 Like

On my machine it seems that ‘memory transfer’ optimization was the fastest, but barely compared to ‘memory size and transfer’.

Yeah. Current master has fixed your issue, thanks for the exact report btw :slight_smile:

Hey kudos to you for all the hard work you are putting in to what is basically a subproject to clean up and optimize the opelcl code of DT…much appreciated

1 Like

You made a very interesting observation, what you have observed is most likely an advantage for processing data in gpu cache over tiling overhead. This is exactly one point of those i am interested to fix in dt post 4.0

But in general you will be much better off if you can avoid tiling.

Just built DT now… seems to behave the same way…the only micronap setting is 250…??

image

Even when I set it to something different and set darktablerc to read only… the value of micronap now stays as set say 50 or 500 and yet darktable-cli uses 250 and if I remove the read only the setting in darktablerc will be changed to 250… so is 250 hard coded in somewhere??

My commit handling this was not merged - wait for Fix OpenCL micro_nap resetting by jenshannoschwalm · Pull Request #11602 · darktable-org/darktable · GitHub

1 Like

Sorry thanks I thought it had gone in…

I thought so too :slight_smile: Pascal merged the pr before i updated it … my bad.

1 Like

Great, thanks for the explaining the possible reason of the observed behaviour, and also thank you for all the work in darktable. Looking forward for post 4.0 while enjoying 3.9