Slideshow and tiling on a Mac (diffuse and sharpen?): extremely slow

Thanks, I knew there were problems with macs and opencl, but given that I couldn’t see any error in the terminal, I thought that maybe I had dodged that bullet.

Now, following your advice, I disabled opencl and I will let it finish the export through the night, hoping that won’t set the laptop on fire while I sleep. ;-). Will report the result in the morning…

This makes me think that cloud GPU would be a good use case for you. At least for the export part. You will need to pay like 1 euro/hr, but it should allow you to export without getting a new computer.

Fyi, i don’t think cloud GPU is feasible at the moment.

Still, what puzzles me is why diffuse or sharpen renders correctly in the developing section of darktable within acceptable times (even when zooming in 100%), but when exporting (with or without opencl) processing times increase by, I would say, 2-3 orders of magnitude. That doesn’t seem right to me, maybe @anon41087856 can chime in on this issue?

thanks in advance
giuseppe

diffuse and sharpen is a very performance consuming module - you can just try to reduce iterations. But aside from using a GPU there’s no magic that can improve cpu performance …

export is done with the whole image while 100% zoom just deals with the visible part of the image. So as long as you don’t have a display able to show the whole image in 100% export requires more performance…

My opinion :

far too little.

On Linux (Nvidia GPU) I had to increase this value to 800 to speedup processing.

I’m not sure about that. For example, my NVidia has 6 GB, but can allocate only 1.5 GB in one chunk (‘allows GPU allocations…’). But when used for processing, darktable allocates all available memory, just not in one operation. See

Of course, that integrated GPU may simply be way too slow for diffuse or sharpen.

The ‘preview pipeline’ is different from the one used for export, it only does a partial rendering, and if you are not zoomed to 100%, it does what export does when you run a scaled-down export and you disable ‘high quality resampling’. See

Ok, I understand that. But:

  • the original image is only 4448 x 2870 pixels
  • rendering the module diffuse or sharpen while full previewing the image in the darktable developing section on the laptop (screen=2560x1600 px), as rendered by hiding all side bars (TAB) takes around 20 sec:

145,093031 [dev_pixelpipe] took 21,418 secs (165,919 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [full]

I would expect that exporting may increase that time maybe tenfold but… a thousand-fold?!

If darktable cannot get hold of your GPU for a while (and we see it’s processing on the CPU, device -1), it falls back to the CPU. Maybe it does not do that with the preview. What do you see on the console when you move the sliders, where is the preview pipeline processed?

With darktable 3.8.1, you’ll get a warning about the fallback.
See https://github.com/darktable-org/darktable/issues/10828 – there will be a warning message logged, and the default value of timeout (before falling back to CPU) will also be higher (but this won’t affect you if it’s already set in your config file)

You could try:

  • setting opencl_scheduling_profile=very fast GPU (either in darktablerc, or also available on the GUI)
  • increasing opencl_mandatory_timeout (default is 200, which means 1 second; I’ve raised mine to 20000, or 100 s, exactly because of diffuse or sharpen). However, such an extreme setting may lead to a complete hang if the driver is broken.

Thanks @kofa @hannoschwalm @g-man and @MStraeten,

just to clarify, export is impossibly slow with or without opencl. With opencl active, darktable does not complain but reports taking about 20 sec per tile and requires computing 47x32 tiles! It still boggles my mind that a module would be so resource intensive that on a new, fairly fast laptop, exporting a single image would take (per my simple estimates) > 8 hours. I feel something is not scaling properly…

On the other hand, if I turn off opencl, I have no idea of the progress of the exporting process as the terminal does not report tiling information. At any rate, with opencl off it still takes an inordinate (and unknown) amount of time: I started the process last night and in the morning there was no sign of completion.

I wonder if I am the only seeing this: surely there are other people running darktable on laptops that do not have separate, large-RAM GPUs…

@gpagnon , would you be ok to share the image and your xmp? I’m interested to see how long it takes to export with my OpenCL off.

Sure. Should I post it here or send it by email (perhaps easier so I don’t have to go through the licensing stuff)?

Please post it here. I’d also like to test this on my machine. You don’t have to go through any licencing stuff, just say it’s under CC0 or whatever (you don’t even have to allow us to redistribute, or to create derivative works).

Here it is, thanks for having a look!

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

DSCF4570.raf (19.0 MB)
DSCF4570.raf.xmp (9.0 KB)

darktable 3.8.0 from the darktable site. This is without OpenCL, on a laptop. i5-10210U, 16 GB RAM. JPEG export (I just reset the params of the export module, then enabled high quality resampling (but that should not matter, as I was exporting at full size)).
general/prefer performance over quality: unchecked

62.267869 [dev] took 0.000 secs (0.000 CPU) to load the image.
62.493615 [export] creating pixelpipe took 0.217 secs (0.734 CPU)
62.497613 [dev_pixelpipe] took 0.003 secs (0.000 CPU) initing base buffer [export]
62.503167 [dev_pixelpipe] took 0.005 secs (0.016 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
62.509267 [dev_pixelpipe] took 0.006 secs (0.000 CPU) processed `white balance' on CPU, blended on CPU [export]
62.515188 [dev_pixelpipe] took 0.005 secs (0.016 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
62.637889 [dev_pixelpipe] took 0.122 secs (0.609 CPU) processed `demosaic' on CPU, blended on CPU [export]
68.842915 [dev_pixelpipe] took 6.204 secs (38.109 CPU) processed `denoise (profiled)' on CPU, blended on CPU [export]
69.506373 [dev_pixelpipe] took 0.662 secs (3.688 CPU) processed `lens correction' on CPU, blended on CPU [export]
69.533293 [dev_pixelpipe] took 0.026 secs (0.125 CPU) processed `exposure' on CPU, blended on CPU [export]
69.813110 [dev_pixelpipe] took 0.279 secs (1.547 CPU) processed `tone equalizer' on CPU, blended on CPU [export]
69.982792 [dev_pixelpipe] took 0.169 secs (1.047 CPU) processed `input color profile' on CPU, blended on CPU [export]
image colorspace transform Lab-->RGB took 0.063 secs (0.375 CPU) [channelmixerrgb ]
70.642730 [dev_pixelpipe] took 0.659 secs (3.922 CPU) processed `color calibration' on CPU, blended on CPU [export]
106.555841 [dev_pixelpipe] took 35.912 secs (207.172 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [export]
112.351773 [dev_pixelpipe] took 5.795 secs (39.891 CPU) processed `color balance rgb' on CPU, blended on CPU [export]
113.208068 [dev_pixelpipe] took 0.855 secs (5.859 CPU) processed `filmic rgb' on CPU, blended on CPU [export]
image colorspace transform RGB-->Lab took 0.039 secs (0.188 CPU) [colorout ]
115.218797 [dev_pixelpipe] took 2.010 secs (13.516 CPU) processed `output color profile' on CPU, blended on CPU [export]
115.277924 [dev_pixelpipe] took 0.058 secs (0.328 CPU) processed `display encoding' on CPU, blended on CPU [export]
115.279126 [dev_process_export] pixel pipeline processing took 52.785 secs (315.844 CPU)
[export_job] exported to `C:\Users\whatever/darktable_exported/DSCF4570.jpg'

I’m running DT 3.8 + windows insider on Windows 11. AMD Ryzen 7 5700G with Radeon Graphics 3.80 GHz and 16Gb Memory.

Export to jpg at 100 quality and yes to high quality resampling

Using the Nvidia RTX 3060 12gb 3584 cores and OpenCL:
59.223316 [dev_process_export] pixel pipeline processing took 2.411 secs (4.625 CPU)

OpenCL disabled:
42.486889 [dev_process_export] pixel pipeline processing took 16.224 secs (226.844 CPU)
The worst offender was:
40.655046 [dev_pixelpipe] took 11.890 secs (164.188 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [export]

The 3584 cores do make a difference with OpenCL, but this is still 16s to export and not hours. I see Kofa was 53secs. It would be interesting to see if someone with a Mac build can do the same experiment (export of your xmp with no OpenCL). My linux machine is down at the moment, so I cant test with it.

I noticed the diffuse was set to 10 iterations. I checked the image with less iterations and drop it to 2 and I cant see a difference at 100%. Turning it off I can see a difference. The export with the iterations at 2 and OpenCL off:
263.606901 [dev_process_export] pixel pipeline processing took 6.614 secs (95.828 CPU)

What are your settings for host memory limit for tilling and for minimum amount of memory for a single buffer tilling?

This sounds suspicious. To me it reads as if you were processing tiles of roughly 2000 pixels on the size (4 MPixels), with an overlap of 1000 pixels with the neighbour? And your image is subdivided into 47x32 = 1504 tiles, so you process 1500 * 4 MPx = 6 GigaPixels; while your input is about 12 MPx. So one tile has 12 MPx/1500 ~= 8500 Px of useful data.
Cross-checking:
Tile size: 2140 pixels on the size
Overlap with neighbours: 1024 pixels on all sides, 2048 pixels
Useful area in middle of tile: a square with sides of (2140 - 2048) = 92 Pixels, total number of pixels in useful area = 8464. This matches quite well with the 12 MPx/1500 ~= 8500 seen above.

My settings related to tiling and memory (from the Windows laptop used above, I did not tweak things there much):
maximum_number_tiles=10000
cache_memory=214932504576
host_memory_limit=8078
singlebuffer_limit=32

https://docs.darktable.org/usermanual/3.8/en/special-topics/memory/#setting-up-darktable-on-32-bit-systems

host memory limit (in MB) for tiling host_memory_limit
This parameter tells darktable how much memory (in MB) it should assume is available to store image buffers during module operations. If an image can not be processed within these limits in one chunk, tiling will take over and process the image in several parts, one after the other. Set this to the lowest possible value of 500 as a starting point. You might experiment later whether you can increase it a bit in order to reduce the overhead of tiling.

minimum amount of memory (in MB) for a single buffer in tiling singlebuffer_limit
This is a second parameter that controls tiling. It sets a lower limit for the size of intermediate image buffers in megabytes. The parameter is needed to avoid excessive tiling in some cases (for some modules). Set this parameter to a low value of 8. You might tentatively increase it to 16 later.

Those descriptions above are for the no longer supported 32-bit systems, BTW, so they are extremely conservative.

My bad. I just noticed you already posted this. I suggest you try setting the host memory limit to 0. I think 0 = no limit.

I’m hoping this avoid the need to create tiles at all. My process is not creating them.

-d memory could also shed some light on how much memory darktable thinks is available.

This is what I see when I start darktable on this 16 GB machine with the params I posted above:

[memory] at startup
[memory] max address space (vmpeak):        42988 kB
[memory] cur address space (vmsize):        40488 kB
[memory] max used memory   (vmhwm ):        22344 kB
[memory] cur used memory   (vmrss ):        22340 Kb
...
[memory] after successful startup
[memory] max address space (vmpeak):      1099928 kB
[memory] cur address space (vmsize):      1099920 kB
[memory] max used memory   (vmhwm ):       177376 kB
[memory] cur used memory   (vmrss ):       177372 Kb

And when exporting:

14.046108 [export] creating pixelpipe took 0.218 secs (0.672 CPU)
[memory] before pixelpipe process
[memory] max address space (vmpeak):      1562640 kB
[memory] cur address space (vmsize):      1559084 kB
[memory] max used memory   (vmhwm ):       243048 kB
[memory] cur used memory   (vmrss ):       228924 Kb

My results kind of similar to yours when not using opencl, except that it gets stuck forever in processing the diffuse or sharpen module:

100,787448 [export] creating pixelpipe took 0,123 secs (0,436 CPU)

100,787506 [pixelpipe_process] [export] using device -1

100,795007 [dev_pixelpipe] took 0,007 secs (0,010 CPU) initing base buffer [export]

100,812557 [dev_pixelpipe] took 0,018 secs (0,089 CPU) processed `raw black/white point’ on CPU, blended on CPU [export]

100,819860 [dev_pixelpipe] took 0,007 secs (0,020 CPU) processed `white balance’ on CPU, blended on CPU [export]

100,823273 [dev_pixelpipe] took 0,003 secs (0,024 CPU) processed `highlight reconstruction’ on CPU, blended on CPU [export]

100,986266 [dev_pixelpipe] took 0,163 secs (0,943 CPU) processed `demosaic’ on CPU, blended on CPU [export]

111,357721 [dev_pixelpipe] took 10,371 secs (21,261 CPU) processed `denoise (profiled)’ on CPU, blended on CPU [export]

111,837032 [dev_pixelpipe] took 0,479 secs (3,470 CPU) processed `lens correction’ on CPU, blended on CPU [export]

111,849873 [dev_pixelpipe] took 0,013 secs (0,100 CPU) processed `exposure’ on CPU, blended on CPU [export]

112,010557 [dev_pixelpipe] took 0,161 secs (1,064 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

112,118910 [dev_pixelpipe] took 0,108 secs (0,804 CPU) processed `input color profile’ on CPU, blended on CPU [export]

image colorspace transform Lab–>RGB took 0,022 secs (0,152 CPU) [channelmixerrgb ]

112,501322 [dev_pixelpipe] took 0,382 secs (2,935 CPU) processed `color calibration’ on CPU, blended on CPU [export]