OpenCL analysis... Darktable... much faster with Opencl disabled...something wrong??

So right up front I don’t have a good technical understanding of OpenCl and all the possible parameters that could be tweaked. I have tried based on the section of the manual that talks about performance and i don’t really see any measurable improvements…

I came across an older thread and copied the raw file and the xmp used for testing.

And I used suggested command lines with darktable-cli to test with and without OpenCL.

Basically I found that the CPU only was almost 15 seconds faster than with OpenCL enabled.

For this particular pipeline the main culprit seems to be astro denoise as it take 15seconds with OpenCl enabled …and only around 4 seconds with it disabled…

My current graphic card in old and crappy …4GB GTX 745…

Below are the two summaries with command line in the txt file and the xmp and raw file used to run it…I am just wondering is anyone that knows about OpenCL parameters can see anything that might be causing this…or have suggestions for tweaks that might be useful…

bench.SRW.zip (19.4 MB)

1 Like

Morning, Peter,

I found an interesting clue in your output:

With openCL enabled the GPU assists in processing:

15.472 secs processed `astrophoto denoise’ on GPU, blended on GPU

But with openCL disabled the GPU is not allowed to assist in processing, so all processing is performed by your faster CPU:

4.911 secs processed `astrophoto denoise’ on CPU, blended on CPU

At least that is what I believe…

Have fun!
Claes in Lund, Sweden

That is interesting. My previous GPU was a crappy old Nvidia GT 1030, with a trifle 2GB ram. It did show significant improvement on processing times using opencl, compared to not using opencl.

I’ve found an old benchmarking I did (using phoronix darktable benchmark). My old 2gb GT 1030 showed quite a speed up over cpu.

@Claes @Brian_Innes I did note the astro module as the big culprit in this xmp…I will try a few more runs on some different pipes and see how it shakes out…

Thx guys

On which OS are you running dt? On Windows, dt seems to be faster without opencl, with a rather new Nvidia card, at least this was the case some months ago.

Tried this on my system (Win10, i5, 3.2 GHz, 16GB Ram, SSD)
GPU: GeForce GTX 1060 6GB

with GPU:
8,702281 [dev_pixelpipe] took 3,868 secs (0,500 CPU) processed `astrophoto denoise’ on GPU, blended on GPU [export]
10,071328 [dev_process_export] pixel pipeline processing took 8,207 secs (8,469 CPU)

without GPU:
11,373314 [dev_pixelpipe] took 5,675 secs (21,625 CPU) processed `astrophoto denoise’ on CPU, blended on CPU [export]
18,431234 [dev_process_export] pixel pipeline processing took 16,563 secs (60,172 CPU)

Normally, OpenCL leads to aprox. 50% speedup on my system.

Thx I recall you trials and tribulations when you upgraded…hopefully you got it sorted as best as can be expected…

Yes I am on Windows…

I did just update to the latest Nvidia drivers not sure if that is the issue…its an old crappy card but until prices make sense I am stuck with it…

I will need to run a few more different module groups and see if it is perhaps only one or two modules that are the culprit with my set up…

Thanks for your feedback

well I tested dt on Windows but I don’t actually use it on Windows, on Linux it runs just fine

I just want to remind the devs, in case they read this, that after I noticed that dt is much slower on Windows on my desktop, I also tested it on Windows on my laptop and the result was the same
something is very broken in the Windows build

I’ll do a bit more testing…thx…likely my crappy card is just slow…although my cpu is also older so I would have thought it would still be faster…that is why I was concerned it was something in my OPENCL settings… I wonder if any one parameter is more impactful on performance than another or its hitting upon the right combination…

Performances test and comparisons is a hard thing I think. It depends on so many things:

  • driver
  • CPU vs GPU (and if one is more powerful/recent than the other). For examples, Intel OpenCL had improved recently but still not the best. Some Nvidia drivers release are not good.
  • code optimizations (darktable had a lot of optimizations last months) and probably some better on CPU code and other on GPU
  • On Linux, for example, I also already see some huge differences between 2 kernel releases. I remember some times a kernel release that had slow down performances and the next release that add a huge boost

The list is not exhaustive!

1 Like

Thanks for the feedback…I need to poke around a bit more…

I’ll second that observation. I have a dualboot laptop and used the windows version of dt last week. Way slower that the Linux version on the same machine.

1 Like

It also depends on the pipeline.

For example, for the pipeline dt uses for determining whether or not OpenCL is faster than CPU or not, OCL on an i5-7200U is declared “slower” than CPU. But on most of the real-world pipelines I used back when I still used dt - OCL was actually much faster.

All it does is run a single gaussian blur to test if CPU or GPU is faster. On my system (i7-5820K, GTX 1080) the CPU is faster for that test, but the GPU is definitely faster in general for real pipelines.

1 Like

I second this, we should throw in how it was compiled too. Some time ago I started compiling from source myself in lieu of using the packages from the SuSE OBS as I found they used some rather unoptimized defaults.

Even just following the bare bones build process documented online for the release it’s better.

Its compiled to work on as many processors as possible, so it eschews CPU specific optimizations where needed.

If you need is for speed, then compile yourself!

1 Like

And if you compile yourself, make sure to create a ‘release build’.
--build-type Release instead of RelWithDebInfo.

This can have a huge performance impact (I’m not talking about 20%, but rather doubling the speed):

1 Like

Hardware: Ryzen 5 5600X, release build, NVidia 1060/6GB

With OpenCL:

kofa@eagle:/tmp$ rm -rf ~/.cache/darktable/mipmaps-6ac0672dac6fe81d5e75505cc7fa15bfeed1acf8.d/ ; ~/darktable-master/bin/darktable-cli bench.SRW bench.SRW.xmp output.pfm --width 0 --height 0 --hq 1 --core -d perf
0.588920 [dev] took 0.101 secs (0.091 CPU) to load the image.
[_dev_auto_apply_presets] missing mandatory module rawprepare for image 1
[_dev_auto_apply_presets] missing mandatory module colorin for image 1
[_dev_auto_apply_presets] missing mandatory module colorout for image 1
[_dev_auto_apply_presets] missing mandatory module gamma for image 1
0.631202 [export] creating pixelpipe took 0.038 secs (0.372 CPU)
0.635601 [dev_pixelpipe] took 0.004 secs (0.015 CPU) initing base buffer [export]
0.642216 [dev_pixelpipe] took 0.007 secs (0.062 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
0.644776 [dev_pixelpipe] took 0.003 secs (0.000 CPU) processed `white balance' on GPU, blended on GPU [export]
0.708916 [dev_pixelpipe] took 0.064 secs (0.468 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
0.797900 [dev_pixelpipe] took 0.089 secs (0.986 CPU) processed `demosaic' on CPU, blended on CPU [export]
1.063531 [dev_pixelpipe] took 0.266 secs (2.177 CPU) processed `tone mapping' on CPU, blended on CPU [export]
1.218839 [dev_pixelpipe] took 0.155 secs (0.471 CPU) processed `lens correction' on GPU, blended on GPU [export]
1.233897 [dev_pixelpipe] took 0.015 secs (0.007 CPU) processed `base curve' on GPU, blended on GPU [export]
1.249046 [dev_pixelpipe] took 0.015 secs (0.007 CPU) processed `input color profile' on GPU, blended on GPU [export]
1.265584 [dev_pixelpipe] took 0.017 secs (0.012 CPU) processed `color reconstruction' on GPU, blended on GPU [export]
3.329972 [dev_pixelpipe] took 2.064 secs (2.043 CPU) processed `astrophoto denoise' on GPU, blended on GPU [export]
3.372937 [dev_pixelpipe] took 0.043 secs (0.031 CPU) processed `global tonemap' on GPU, blended on GPU [export]
3.448512 [dev_pixelpipe] took 0.076 secs (0.056 CPU) processed `shadows and highlights' on GPU, blended on GPU [export]
3.788413 [dev_pixelpipe] took 0.340 secs (0.296 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
3.812484 [dev_pixelpipe] took 0.024 secs (0.016 CPU) processed `local contrast' on GPU, blended on GPU [export]
3.838831 [dev_pixelpipe] took 0.026 secs (0.015 CPU) processed `color zones' on GPU, blended on GPU [export]
3.851727 [dev_pixelpipe] took 0.013 secs (0.009 CPU) processed `levels' on GPU, blended on GPU [export]
3.879112 [dev_pixelpipe] took 0.027 secs (0.015 CPU) processed `sharpen' on GPU, blended on GPU [export]
3.891835 [dev_pixelpipe] took 0.013 secs (0.009 CPU) processed `color contrast' on GPU, blended on GPU [export]
4.252303 [dev_pixelpipe] took 0.360 secs (3.890 CPU) processed `output color profile' on CPU, blended on CPU [export]
4.252455 [dev_process_export] pixel pipeline processing took 3.621 secs (10.586 CPU)
[export_job] exported to `output.pfm'

Without OpenCL:

kofa@eagle:/tmp$ rm -rf ~/.cache/darktable/mipmaps-6ac0672dac6fe81d5e75505cc7fa15bfeed1acf8.d/ ; ~/darktable-master/bin/darktable-cli bench.SRW bench.SRW.xmp output.pfm --width 0 --height 0 --hq 1 --core -d perf --disable-opencl
0.488711 [dev] took 0.101 secs (0.110 CPU) to load the image.
[_dev_auto_apply_presets] missing mandatory module rawprepare for image 1
[_dev_auto_apply_presets] missing mandatory module colorin for image 1
[_dev_auto_apply_presets] missing mandatory module colorout for image 1
[_dev_auto_apply_presets] missing mandatory module gamma for image 1
0.531560 [export] creating pixelpipe took 0.039 secs (0.386 CPU)
0.535819 [dev_pixelpipe] took 0.004 secs (0.013 CPU) initing base buffer [export]
0.542798 [dev_pixelpipe] took 0.007 secs (0.038 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
0.550401 [dev_pixelpipe] took 0.008 secs (0.065 CPU) processed `white balance' on CPU, blended on CPU [export]
0.592788 [dev_pixelpipe] took 0.042 secs (0.515 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
0.682106 [dev_pixelpipe] took 0.089 secs (0.962 CPU) processed `demosaic' on CPU, blended on CPU [export]
0.947667 [dev_pixelpipe] took 0.265 secs (2.104 CPU) processed `tone mapping' on CPU, blended on CPU [export]
1.296137 [dev_pixelpipe] took 0.348 secs (3.047 CPU) processed `lens correction' on CPU, blended on CPU [export]
1.333320 [dev_pixelpipe] took 0.037 secs (0.426 CPU) processed `base curve' on CPU, blended on CPU [export]
1.357736 [dev_pixelpipe] took 0.024 secs (0.280 CPU) processed `input color profile' on CPU, blended on CPU [export]
1.429356 [dev_pixelpipe] took 0.072 secs (0.831 CPU) processed `color reconstruction' on CPU, blended on CPU [export]
3.174414 [dev_pixelpipe] took 1.745 secs (20.621 CPU) processed `astrophoto denoise' on CPU, blended on CPU [export]
3.285510 [dev_pixelpipe] took 0.111 secs (1.312 CPU) processed `global tonemap' on CPU, blended on CPU [export]
3.489421 [dev_pixelpipe] took 0.204 secs (2.201 CPU) processed `shadows and highlights' on CPU, blended on CPU [export]
5.746928 [dev_pixelpipe] took 2.257 secs (25.478 CPU) processed `contrast equalizer' on CPU, blended on CPU [export]
5.807564 [dev_pixelpipe] took 0.061 secs (0.686 CPU) processed `local contrast' on CPU, blended on CPU [export]
5.890651 [dev_pixelpipe] took 0.083 secs (0.983 CPU) processed `color zones' on CPU, blended on CPU [export]
5.926817 [dev_pixelpipe] took 0.036 secs (0.409 CPU) processed `levels' on CPU, blended on CPU [export]
5.966252 [dev_pixelpipe] took 0.039 secs (0.456 CPU) processed `sharpen' on CPU, blended on CPU [export]
5.989733 [dev_pixelpipe] took 0.023 secs (0.268 CPU) processed `color contrast' on CPU, blended on CPU [export]
6.314755 [dev_pixelpipe] took 0.325 secs (3.870 CPU) processed `output color profile' on CPU, blended on CPU [export]
6.314858 [dev_process_export] pixel pipeline processing took 5.783 secs (64.578 CPU)
[export_job] exported to `output.pfm'

It’s interesting to see identical cards (1060/6GB) performing so differently. Release build vs. RelWithDebInfo should not affect the GPU, AFAIK. Worse GPU time and better CPU time for the same module.

Maybe the reason is the CPU? Mine is a i5-4570 (introduction in 2013), yours is a Ryzen 5 5600X (introduction in 2021). The CPU obvioulsy takes a larger share of the computing (i5: 0,500 CPU, R5: 2.043 CPU).