Darktable/OpenCL optimizations for medium/high DPI displays for interactive image editing

ari · March 1, 2019, 5:31pm

Using darktable with a 1440p display, amd rx560 with opencl enabled, and kabylake i7 CPU. Ubuntu 18.10. AMD ROCM opencl driver.

OpenCL provides a massive speedup for image Export (around 500%) but I am not seeing a any increase in performance with openCL during interactive editing.

Measuring performance metrics I can see a decrease in pixelpipe processing times with opencl enabled being reported on the terminal (-perf option) however I don’t ‘feel’ darktable any more snappy when e.g. zooming in when all my fave modules are activated [1]. It takes 3-4 seconds for darktable to finish “Working…” before the zoomed-in or out image image is displayed, with or without openCL enabled.

I did notice that if I reduce the size of the darktable window (I call it viewport) then zooming in and out becomes a lot snappier / responsive.

I followed the opencl optimization guidelines here: https://www.darktable.org/usermanual/en/darktable_and_opencl_optimization.html

but still I don’t see any perceptible difference in snappiness for interactive use with opencl activated vs. de-activated.

Wondering - any recommended setting/s that people are using in order to leverage more OpenCL for interative image editing?

I noticed some ppl lower the screen resolution when working with DT but I’d rather avoid that approach

[1] profiled denoise, local-contrast/laplacian, sharpen, velvia, shadows&highlights, and occassionally a few more. e.g. Lens correction, exposure tweaks, color balance &c.

paperdigits · March 1, 2019, 5:38pm

You’ve neglected to tell us which version of darktable and how you installed it: from source, package manager, PPA, flatpak, snap, or something else.

ari · March 1, 2019, 6:40pm

Fair Q! This is DT 2.6 on ubuntu 18.10, installed from DT’s stable PPA. OpenCL via AMD’s ROCM (AMD’s ppa)

asn · March 3, 2019, 9:13am

Did you select very fast GPU in the settings?

ari · March 3, 2019, 6:06pm

I did try it, with no perceptible improvement in ‘snappiness’ when going from ‘fit to screen’ to 100% zoom in darkroom mode, which in Full Screen (1440p) with all typical modules activated takes 4-5 seconds. This is processing Nikon D500 NEFs (21 Mpix)

As I’ve read in the docs, “Very Fast GPU” profile moves the ‘preview’ window compute to the GPU however during zoom-in/zoom-out the preview doesn’t change, so I guess it makes sense that changing profiles from Default to Fast GPU doesn’t change ‘snappiness’. I also tried a similar test to this one:

https://www.youtube.com/watch?v=xuuiUhMr-lQ

i.e measuring the time of appying all modules from scratch, with and without opencl and different profiles. In my system, it takes around 2 seconds regardless, I guess the fact that the cpu is decent (kabylake i7, ddr4 RAM @ ~2.9Ghz), the initial module compute time may not be a bottleneck (although the GPU does increase perf for Exports in a massive way)

My question to DT users using OpenCL: with DT working full screen, in darkroom mode, when zooming from “Fit to screen” to 100% on a ‘typical’ RAW with ‘typical’ modules (sharpen, local-contrast, profiled denoise &c) activated, how long it takes DT to finish ‘working…’ in your setup? (and: did you find a way to tune DT to minimize the ‘working…’ time)

Claes · March 3, 2019, 7:23pm

About 1.3 seconds here.
Ryzen 7 2700X
16 gig RAM
GTX 1050

asn · March 3, 2019, 8:14pm

Check if you have the following in your darktablerc file:

opencl_async_pixelpipe=true
opencl_device_priority=*/!0,*/*/*
opencl_mandatory_timeout=250
opencl_scheduling_profile=very fast GPU

ari · March 3, 2019, 10:13pm

@Claes what settings do you use for profiled denoise? and what is your monitor resolution? (assuming you are using DT at full-screen)

I have been trying different things in active modules and I noticed that by far the largest contributor to the 4-5 sec processing delay after 100% zoom is denoising, and that by changing to Wavelet the delay decreases a lot - under 2 seconds. This is a lot more usable for interactive editing. Hugely better when working at 100% and panning to inspect results.

For de-noising, I was using non-local means (0.25) in my reference profile for this camera and I find that to be roughly equivalent to wavelets 0.14 + some fine tuning of the strength/frequency curve. So: I have changed my reference profile to wavelets

@asn I am using those settings. Doing some more perf measurements I could confirm that “very fast GPU” produces a consistent marginal improvement (around 5% as measured with -perf), even if my GPU is nowhere close to high end (rx560 / 65watts / 2.4TFlops)

anon41087856 · March 3, 2019, 10:14pm

I know. Sit tight, it’s WIP.

For example, I have found that, when you zoom-scroll on pictures in darkroom, dt triggers a preview recalculation when you begin to scroll and when you finish scrolling. Basically, it reprocesses the picture twice. I have fixed that by adding a time-out, it should be merged soon.

Also, even when the left panel is collapsed and the navigation thumbnail hidden, dt recomputes a full pipe just the same for the thumbnail everytime you change a setting. That’s 0.5 to 2 s lost. That won’t be difficult to fix, but I think the best would be to just resize the main view.

Finally, dt abuses Cairo painting to draw fake shadows, borders, etc. The thing is Cairo is single-threaded, and the way shadows were drawn was by overlaying 8 rectangles of increasing sizes and decreasing opacities. In addition of being ugly (real shadows use blur, not linear gradients), the same pixels were painted over 8 times to be finally occluded by an image (histogram or thumbnail), so that was very unefficient. That is fixed too, but not yet merged.

OpenCL was quite the bruteforce way to go, in dt, given the amount of possible speed improvements on CPU path. I’m slowly optimizing.

ari · March 3, 2019, 10:57pm

@anon41087856 this sounds great!

During the tests I did notice that scroll-zoom takes a lot longer to produce a result than using the darkroom’s zoom shortcut (alt-1)

To summarize the small changes I’ve done so far to speed things up in for interactive editing:

using OpenCL profile = Very Fast GPU
switch profiled-denoise to wavelets
switch local-contrast to Bilateral - this helped a lot with DT 2.6. WIth the AMD ROCM opencl driver, Local Contrast in laplacian mode has a problem and needs to be forced to use CPU instead of GPU - DT’s kernel triggers a bug in the driver [1] (or perhaps it is buggy itself); during perf testing I could see that local-contrast is very much compute intensive and letting it use GPU makes a difference, hence back to Bilateral as in the old days.
… and let’s wait for the DT2.7 improvements being driven by Aurelien - until then it helps to zoom-in with Alt-1 and then pan using the navigation window rather than scroll-zoom. Here’s a feature request that will help avoid having to pan so often [2]

[1] https://redmine.darktable.org/issues/12423

[2] https://redmine.darktable.org/issues/10779

Claes · March 4, 2019, 8:55am

Morning!

I presume that you have a small typo there (CPU instead of GPU)?
Forgot to tell you that I am running dt 2.7.0 (i.e. the git version).

How about sending me one of your photos + its xmp file with your settings?
That would make it easier for us to compare execution times.

Have fun!
Claes in Lund, Sweden

ari · March 4, 2019, 1:24pm

Yes that’s right, post fixed!
I will benchmark a sample NEF and then upload it together with the xmp file.
Using DT2.7 maybe explains the better numbers you are seeing since at least one of Aurelien’s optimizations was already merged

AxelG · March 9, 2019, 10:37am

Hello ari,

as I said in my other thread (link below): I have the feeling, as long you have just one GPU inside, dt seems to load the CPU once a task has been adressed to the GPU, no matter the GPU is fully loaded.

My case was GPU runs around 25% load and CPU on all (by that time 6) cores at 100% when pixelpipe recalculation was required, e.g. like your case, scrolling (the harder with “heavy-duty-modules” like denoise profiled or equalizer).

Then when you have 2xGPU things change. Suddenly the CPU-load dissapeared and things run way faster.

Meanwhile I further updated and have an i9 9900k. Still I can find scenarios, where things (whitebalance) can be snappier and both GPU not running on 100% yet. So opencl still leaves us space for improvement

To me it looks like it is very closely linked to my thread here: (besides the topic about the beer )

I would be a happy tester, if needed

ari · June 28, 2019, 4:27pm

Wondering if DT2.7 already includes the zoom-scroll and Cairo painting optimizations? Browsing commits amd merges done over the last few months I see tons of great stuff coming down the pipe but unable to recognize these.

I am timing the move of my main workstation to 2.7 and these improvements in day to day usability would quickly tip the balance :). Thanks @anon41087856 and the many other devs for the great work that keeps coming in