Performance issues

I can PM the screenshot, def have H.264/H.265

But back to Darktable indeed :slight_smile:
Just gave it a test, and seemed perfectly fine. But then hung on me and it didnt appear to write the last 5-6 history step the XMP. Had lost quite a bit of history when I got back in. As well as some modules in my favourite tab :frowning:

Ill start with -d all and see now.

I have used ProtonVPN for a few years on various Linux distros (currently Arch and Fedora) and iOS. It has always been fast and reliable for me. I use it with openVPN client.

1 Like

First I think that the user manual should explain how to debug Darktable. On windows you have to do as follows:

start C:“program files”\darktable\bin\darktable.exe -d perf -d opencl > log.txt 2>&1

the output is written to the hidden(!) file:

C:\Users[username]\AppData\Local\Microsoft\Windows\INetCache\darktable\darktable-log.txt

This is not something you just know!

Now on to the performance issue.

Exporting the example from @KristijanZic took 11,418 secs (15,422 CPU). This is fast and ok in my opinion. But, but the image and edits do not seem typical.

The image below took 85,214 secs (223,359 CPU) even though the edit only consists of a few steps including tone equalizer and denoise. I think this is slow.

I upload the rawfile, xmp and the log.


DSC_2575.NEF (29.9 MB) DSC_2575.NEF.xmp (7.2 KB)
darktable-log.3.txt (4.3 KB)

1 Like

DT is perfectly usable on Windows with OpenCL and everything, at least with nVidia cards. I use it everyday and no major loss of performance with respect to Linux (which I use for development because it compiles faster).
Also a fantastic work on the GUI has been done in 3.2, which is now fully themable, OS and DPI independent, much better than many other multiplatform SW.

Try avoiding tiling (see the links in one of my previous posts). Also, check why contrast equalizer (‘atrous’) was not processed using OpenCL.

60,572380 [opencl_atrous] couldn't enqueue kernel! -4
60,603627 [default_process_tiling_opencl_ptp] couldn't run process_cl() for module 'atrous' in tiling mode: 0
60,603627 [opencl_pixelpipe] could not run module 'atrous' on gpu. falling back to cpu path
97,638092 [dev_pixelpipe] took 39,362 secs (138,781 CPU) processed `contrast equalizer' on CPU with tiling, blended on CPU [export]

Weird. Using darktable 3.3.0+1314~gcdaaee146 (git master), everything is much faster on my 12-year-old machine, Core2 Duo, 4 GB of RAM, NVidia 1060 with 6 GB of RAM, using your NEF and XMP:

35.891082 [dev] took 0.000 secs (0.000 CPU) to load the image.
36.154169 [export] creating pixelpipe took 0.236 secs (0.318 CPU)
36.154255 [pixelpipe_process] [export] using device 0
36.155219 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
36.184672 [dev_pixelpipe] took 0.029 secs (0.048 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
36.188469 [dev_pixelpipe] took 0.004 secs (0.004 CPU) processed `white balance' on GPU, blended on GPU [export]
36.194689 [dev_pixelpipe] took 0.006 secs (0.010 CPU) processed `highlight reconstruction' on GPU, blended on GPU [export]
41.377531 [dev_pixelpipe] took 5.183 secs (8.119 CPU) processed `demosaic' on CPU with tiling, blended on CPU [export]
41.514513 [dev_pixelpipe] took 0.137 secs (0.145 CPU) processed `lens correction' on GPU, blended on GPU [export]
41.517665 [dev_pixelpipe] took 0.003 secs (0.001 CPU) processed `exposure' on GPU, blended on GPU [export]
41.817558 [dev_pixelpipe] took 0.300 secs (0.421 CPU) processed `tone equalizer' on CPU, blended on CPU [export]
41.837743 [dev_pixelpipe] took 0.020 secs (0.018 CPU) processed `input color profile' on GPU, blended on GPU [export]
41.882531 [dev_pixelpipe] took 0.045 secs (0.022 CPU) processed `denoise (non-local means)' on GPU, blended on GPU [export]
41.908088 [dev_pixelpipe] took 0.026 secs (0.000 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
41.936057 [dev_pixelpipe] took 0.028 secs (0.012 CPU) processed `local contrast' on GPU, blended on GPU [export]
41.942914 [dev_pixelpipe] took 0.007 secs (0.003 CPU) processed `output color profile' on GPU, blended on GPU [export]
42.020912 [dev_pixelpipe] took 0.078 secs (0.086 CPU) processed `dithering' on CPU, blended on CPU [export]
42.059684 [dev_pixelpipe] took 0.039 secs (0.061 CPU) processed `display encoding' on CPU, blended on CPU [export]
42.059903 [opencl_profiling] profiling device 0 ('GeForce GTX 1060 6GB'):
42.059915 [opencl_profiling] spent  0.0519 seconds in [Write Image (from host to device)]
42.059989 [opencl_profiling] spent  0.0016 seconds in rawprepare_1f
42.060059 [opencl_profiling] spent  0.0015 seconds in whitebalance_1f
42.060126 [opencl_profiling] spent  0.0027 seconds in highlights_1f_lch_bayer
42.060204 [opencl_profiling] spent  0.1669 seconds in [Read Image (from device to host)]
42.060272 [opencl_profiling] spent  0.0330 seconds in [Write Buffer (from host to device)]
42.060339 [opencl_profiling] spent  0.0006 seconds in lens_vignette
42.060407 [opencl_profiling] spent  0.0028 seconds in lens_distort_lanczos3
42.060473 [opencl_profiling] spent  0.0004 seconds in exposure
42.060539 [opencl_profiling] spent  0.0005 seconds in colorin_unbound
42.060618 [opencl_profiling] spent  0.0002 seconds in nlmeans_init
42.060683 [opencl_profiling] spent  0.0039 seconds in nlmeans_dist
42.060748 [opencl_profiling] spent  0.0026 seconds in nlmeans_horiz
42.060812 [opencl_profiling] spent  0.0057 seconds in nlmeans_vert
42.060877 [opencl_profiling] spent  0.0089 seconds in nlmeans_accu
42.060942 [opencl_profiling] spent  0.0006 seconds in nlmeans_finish
42.061014 [opencl_profiling] spent  0.0004 seconds in [Copy Image (on device)]
42.061079 [opencl_profiling] spent  0.0109 seconds in eaw_decompose
42.061144 [opencl_profiling] spent  0.0038 seconds in eaw_synthesize
42.061210 [opencl_profiling] spent  0.0004 seconds in pad_input
42.061276 [opencl_profiling] spent  0.0039 seconds in gauss_reduce
42.061339 [opencl_profiling] spent  0.0029 seconds in process_curve
42.061403 [opencl_profiling] spent  0.0043 seconds in laplacian_assemble
42.061467 [opencl_profiling] spent  0.0004 seconds in write_back
42.061531 [opencl_profiling] spent  0.0010 seconds in colorout
42.061594 [opencl_profiling] spent  0.3119 seconds totally in command queue (with 0 events missing)
42.061730 [dev_process_export] pixel pipeline processing took 5.907 secs (8.953 CPU)

The main differences:
You: 39,631271 [dev_pixelpipe] took 21,260 secs (74,266 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

Me: 41.817558 [dev_pixelpipe] took 0.300 secs (0.421 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

You: 58,275698 [dev_pixelpipe] took 18,082 secs (2,172 CPU) processed `denoise (non-local means)’ on GPU, blended on GPU [export]

Me: 41.882531 [dev_pixelpipe] took 0.045 secs (0.022 CPU) processed `denoise (non-local means)’ on GPU, blended on GPU [export]

You: 97,638092 [dev_pixelpipe] took 39,362 secs (138,781 CPU) processed `contrast equalizer’ on CPU with tiling, blended on CPU [export]

Me: 41.908088 [dev_pixelpipe] took 0.026 secs (0.000 CPU) processed `contrast equalizer’ on GPU, blended on GPU [export]

Im just curious, should i be happy with this? Its a fairly typical outdoor portrait edit for me.

30020,848272 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
30020,859474 [dev_pixelpipe] took 0,011 secs (0,007 CPU) processed raw black/white point' on GPU, blended on GPU [export] 30020,861906 [dev_pixelpipe] took 0,002 secs (0,001 CPU) processed white balance’ on GPU, blended on GPU [export]
30020,865500 [dev_pixelpipe] took 0,004 secs (0,002 CPU) processed highlight reconstruction' on GPU, blended on GPU [export] 30020,882585 [dev_pixelpipe] took 0,017 secs (0,009 CPU) processed demosaic’ on GPU, blended on GPU [export]
30020,893199 [dev_pixelpipe] took 0,011 secs (0,003 CPU) processed orientation' on GPU, blended on GPU [export] 30020,990140 [dev_pixelpipe] took 0,097 secs (0,249 CPU) processed retouch’ on GPU, blended on GPU [export]
30021,004332 [dev_pixelpipe] took 0,014 secs (0,003 CPU) processed exposure' on GPU, blended on GPU [export] 30021,022175 [dev_pixelpipe] took 0,018 secs (0,007 CPU) processed input color profile’ on GPU, blended on GPU [export]
30021,951867 [dev_pixelpipe] took 0,930 secs (3,346 CPU) processed contrast equalizer' on GPU, blended on GPU [export] 30022,460872 [dev_pixelpipe] took 0,509 secs (0,689 CPU) processed lowpass’ on GPU, blended on GPU [export]
30022,472167 [dev_pixelpipe] took 0,011 secs (0,004 CPU) processed color balance' on GPU, blended on GPU [export] 30022,912286 [dev_pixelpipe] took 0,440 secs (0,739 CPU) processed color balance Eyes’ on GPU, blended on GPU [export]
image colorspace transform Lab–>RGB took 0,026 secs (0,563 CPU) [filmicrgb ]
30023,182908 [dev_pixelpipe] took 0,271 secs (2,725 CPU) processed filmic rgb' on CPU, blended on CPU [export] image colorspace transform RGB-->Lab took 0,004 secs (0,001 GPU) [tonecurve ] 30023,243620 [dev_pixelpipe] took 0,061 secs (0,046 CPU) processed tone curve’ on GPU, blended on GPU [export]
30023,771806 [dev_pixelpipe] took 0,528 secs (0,784 CPU) processed local contrast Eyes' on GPU, blended on GPU [export] 30023,791621 [dev_pixelpipe] took 0,020 secs (0,012 CPU) processed output color profile’ on GPU, blended on GPU [export]
30023,860325 [dev_pixelpipe] took 0,069 secs (0,678 CPU) processed display encoding' on CPU, blended on CPU [export] 30023,860375 [dev_process_export] pixel pipeline processing took 3,018 secs (9,307 CPU) [export_job] exported to /run/media/tore/WORKHORSE/CamDump/20203110 autumn shoot/darktable_exported/DSC_5195.jpg’

Yes, this is a remarkable difference.
Have you any kind of special performance settings on your system or in DT? And how do I check why contrast equalizer (‘atrous’) was not processed using OpenCL?..but this is apparently not the only problem.

In short: I (and maybe others) will appreciate some help and good advice in this matter because I would love to obtain the kind of performance that you experience.

My pc is: Intel® Core™ i7-4510U CPU @ 2.00GHz 2.60 GHz with 8,00 GB RAM

NVIDIA GeForce GTX 850M

Well, I’m copying here parts of my config that might be relevant (though I suspect something else is the problem), for what it’s worth. I do not have any special settings, except for those recommended for 32-bit systems (I run 64 bit Linux, but with only 4 GB of RAM). I have ‘performance mode’ enabled, but that should not affect exports (and even if it did, it should not produce speed-ups in the 100x-1000x range).

$ grep -iE "(opencl)|(perf)|(mem)|(thread)" darktablerc  
cache_memory=536870912
host_memory_limit=500
opencl=TRUE
opencl_async_pixelpipe=false
opencl_avoid_atomics=false
opencl_checksum=2160716749
opencl_device_priority=*/!0,*/*/*/!0,*
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=400
opencl_memory_requirement=768
opencl_micro_nap=1000
opencl_number_event_handles=25
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false
performance_configuration_version_completed=1
plugins/lighttable/preview/max_in_memory_images=4
ui/performance=TRUE
worker_threads=1

Thank you for a quick response.

Where do I put: $ grep -iE “(opencl)|(perf)|(mem)|(thread)” darktablerc

The rest of the settings are in darktablerec. Right?

I have noted the following differences only. I have:

host_memory_limit=1500

opencl_checksum=3601598440

opencl_device_priority=!1,s/s/s/s - s means “star”

opencl_scheduling_profile=default

No ui/performance=TRUE statement

worker_threads=8

Can I just copy all your settings? What about opencl checksum?

The ‘grep’ stuff was just a command to show which values I extracted (‘grep’ is a tool to find text in files). Don’t put it in your config file! :slight_smile:

You have the default host_memory_limit, it should be fine. I only have 4 GB, so I use a smaller number.
Don’t touch the checksum.
opencl_device_priority depends on your machine. You probably have an integrated device as well as a card, right? I’d assume that’s why we have different numbers.
opencl_scheduling_profile - compared to your graphics card, you have a much faster CPU than I do (a relatively modern NVidia 1060, but an ancient Core2 CPU). You should leave it as it is.
ui/performance - I think that’s because I’m on the master (unstable, development) version.
worker_threads – again, I had to reduce that because of my old machine (with 8 concurrent operations, I’d run out of RAM). Anyway, that setting only affects thumbnail generation, as far as I know.

Maybe getting one of the developers here would help? @hanatos, perhaps? (I mentioned you because some people experience 100x - 1000x the module run times than I do on my 12-year-old machine - see above: Performance issues - #33 by kofa. Not sure whom to contact, really.)

Ok, but is that graphics card actually used? Nothing in the information you provide shows that.

And I know that with some other programs, the NVidia needs to be explicitly selected, and you need the drivers from NVidia, not the ones that come through standard windows update (assuming you use windows, though for linux you’d also need the drivers provided by NVidia, and not the open source nouveau driver)

The use or non-use of the graphics card does not explain this:
obe ( Intel® Core™ i7-4510U CPU @ 2.00GHz 2.60 GHz with 8,00 GB RAM): 39,631271 [dev_pixelpipe] took 21,260 secs (74,266 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

kofa (Intel Core2 Duo @ 2.33 GHz with 4 GB RAM): 41.817558 [dev_pixelpipe] took 0.300 secs (0.421 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

hard to diagnose from here. also i’m not super deep in dt’s code any more nowadays. it’s clearly wasting a cycle or two between runs of opencl kernels, but the numbers you pointed to seem pathological. did you also compare --disable-opencl runs to rule out that it may be a driver issue of some sort? other than that, is this a debug build? we used to have a mode that would compile in all sorts of expensive paranoid checks that would slow down the pipeline quite significantly. if i was to debug this further, i’d probably run it through perf and see where all the time goes.

I changed the opencl device priority myself in order to try to utilize the NVIDIA card more. You are right, I have integrated graphics.

So the bottom line is: no change to config file.

I use the master (cdaaee). @obe, what version and OS do you use? Did you build it yourself? If yes, how?
@hanatos: it was a CPU code path (tone equalizer has no OpenCL implementation). OpenCL paths are also much slower.

hm, i’ve seen the tone equaliser. but there were others that seemed to indicate GPU usage. this would still mean that there’d be some copying over etc and that the driver gets to do things that may be stupid. i’d still run the --disable-opencl version, just to be sure we’re not chasing ghosts here.

Turning opencl off resulted in this:

123,542625 [dev_process_export] pixel pipeline processing took 106,910 secs (395,641 CPU)

I use windows 10 64 bits version.

and Darktable 3.2.1

Yes, a few seconds to export an image are typical (if very complex edits are involved, especially without a discrete graphics card, even half a minute would not be surprising).