Slideshow and tiling on a Mac (diffuse and sharpen?): extremely slow

Hi all,

I just noticed this: if I try to do a slideshow on a single image (using 1 for debugging purposes) where I used the module Diffuse or Sharpen (preset “add local contrast”), darktable (3.8.0) seems to hang forever.

I ran again darktable from the terminal with “-d opencl -d perf”, and I noticed that the processing is excruciatingly slow because of tiling. Here’s what the terminal says:

733,156404 [default_process_tiling_cl_ptp] use tiling on module ‘diffuse’ for image with full size 4299 x 2865

733,156432 [default_process_tiling_cl_ptp] (47 x 32) tiles with max dimensions 2140 x 2140 and overlap 1024

733,156436 [default_process_tiling_cl_ptp] tile (0, 0) with 2140 x 2140 at origin [0, 0]

754,673257 [default_process_tiling_cl_ptp] tile (0, 1) with 2140 x 2140 at origin [0, 92]

776,086378 [default_process_tiling_cl_ptp] tile (0, 2) with 2140 x 2140 at origin [0, 184]

797,651121 [default_process_tiling_cl_ptp] tile (0, 3) with 2140 x 2140 at origin [0, 276]

at which point I terminate the process because with 20 sec per tile and 47 x 32 tiles, that would amount to around 500 min wait time (for a slideshow of a single image!).

In darktable’s preference opencl is on and I left the parameters for processing at their default value:

  • host memory limit for tiling: 1500 MB
  • min amount of memory for a single buffer in tiling: 16
  • open cl: on
  • scheduling profile: default

This is what the terminal reports about opencl and GPU:

0.112876 [opencl_init] opencl related configuration options:
0.112893 [opencl_init]
0.112896 [opencl_init] opencl: 1
0.112897 [opencl_init] opencl_scheduling_profile: ‘default’
0.112899 [opencl_init] opencl_library: ‘’
0.112902 [opencl_init] opencl_memory_requirement: 768
0.112904 [opencl_init] opencl_memory_headroom: 400
0.112906 [opencl_init] opencl_device_priority: ‘/!0,///!0,*’
0.112909 [opencl_init] opencl_mandatory_timeout: 200
0.112911 [opencl_init] opencl_size_roundup: 16
0.112913 [opencl_init] opencl_async_pixelpipe: 0
0.112914 [opencl_init] opencl_synch_cache: active module
0.112917 [opencl_init] opencl_number_event_handles: 25
0.112919 [opencl_init] opencl_micro_nap: 1000
0.112920 [opencl_init] opencl_use_pinned_memory: 0
0.112922 [opencl_init] opencl_use_cpu_devices: 0
0.112924 [opencl_init] opencl_avoid_atomics: 0
0.112925 [opencl_init]
0.113040 [opencl_init] found opencl runtime library ‘/System/Library/Frameworks/OpenCL.framework/Versions/Current/OpenCL’
0.113065 [opencl_init] opencl library ‘/System/Library/Frameworks/OpenCL.framework/Versions/Current/OpenCL’ found on your system and loaded
0.113070 [opencl_init] found 1 platform
0.123694 [opencl_init] found 2 devices
0.123730 [opencl_init] discarding CPU device 0 Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz'. 0.123745 [opencl_init] device 1 Intel(R) Iris™ Plus Graphics’ supports image sizes of 16384 x 16384
0.123748 [opencl_init] device 1 Intel(R) Iris(TM) Plus Graphics' allows GPU memory allocations of up to 384MB [opencl_init] device 1: Intel(R) Iris(TM) Plus Graphics CANONICAL_NAME: intelri GLOBAL_MEM_SIZE: 1536MB MAX_WORK_GROUP_SIZE: 256 MAX_WORK_ITEM_DIMENSIONS: 3 MAX_WORK_ITEM_SIZES: [ 256 256 256 ] DRIVER_VERSION: 1.2(Nov 30 2021 21:37:29) DEVICE_VERSION: OpenCL 1.2 0.126281 [opencl_init] options for OpenCL compiler: -w -cl-fast-relaxed-math -DUNKNOWN=1 -I/Applications/darktable.app/Contents/Resources/share/darktable/kernels 0.126601 [opencl_init] compiling program demosaic_ppg.cl’ …
0.126771 [opencl_load_program] loaded cached binary program from file ‘/Users/gp/.cache/darktable/cached_kernels_for_IntelRIrisTMPlusGraphics_12Nov302021213729/demosaic_ppg.cl.bin’ MD5: ‘2eaae229d53a295428553b86397ee691’
0.126777 [opencl_load_program] successfully loaded program from ‘/Applications/darktable.app/Contents/Resources/share/darktable/kernels/demosaic_ppg.cl’ MD5: ‘2eaae229d53a295428553b86397ee691’
0.128074 [opencl_build_program] successfully built program
0.128093 [opencl_build_program] BUILD STATUS: 0

I wonder what the problem is and whether there is a way around it. My system is a MacBook Pro 2020, with a 2GHz Quad-core intel core i5, with 16 GB of RAM, and integrated graphics Intel Iris Plus Graphics 1536 MB.

Thanks in advance for any advice!
giuseppe

Update: in the text below, you’ll see references to export, as I never use the slideshow.

If, during processing, you see using device -1, then the CPU is used, not the GPU. Maybe your GPU is not powerful enough, or it also runs out of memory. You may need to post everything you get between [export] creating pixelpipe and exported to (found in the the first and last messages generated when exporting an image). Also, when you exit darktable, you get a summary line like:

4824.604367 [opencl_summary_statistics] device 'NVIDIA GeForce GTX 1060 6GB' (0): 240431 out of 240431 events were successful and 0 events lost

Is there a difference between the total number of events and those successful? Are there any lost events reported?

A couple of things to try: do the same steps but with OpenCL off. Then do the same steps with the diffuse module off. Lastly, do an export of the same image with the module on. Any difference in rendering time?

This part of your log is concerning: 0.123748 [opencl_init] device 1 Intel(R) Iris(TM) Plus Graphics' allows GPU memory allocations of up to 384MB

384MB is small. Not all of the gpu memory will be available to DT (the OS still needs to run). Is this a dedicated GPU card or part of the CPU?

Thanks @kofa , you’re right, the issue is the same with export: in fact it replicates exactly when I use that.

I tried to turn off opencl, as @g-man suggested (thanks also!). Darktable hangs seemingly forever when exporting the same image, I think because it is trying to do the computations required by diffuse and sharpen (no information about tiling here). The terminal last lines, before I kill the process, are:

38,796794 [pixelpipe_process] [export] using device -1

38,803604 [dev_pixelpipe] took 0,006 secs (0,009 CPU) initing base buffer [export]

38,819044 [dev_pixelpipe] took 0,015 secs (0,088 CPU) processed `raw black/white point’ on CPU, blended on CPU [export]

38,825913 [dev_pixelpipe] took 0,007 secs (0,020 CPU) processed `white balance’ on CPU, blended on CPU [export]

38,829247 [dev_pixelpipe] took 0,003 secs (0,024 CPU) processed `highlight reconstruction’ on CPU, blended on CPU [export]

38,980668 [dev_pixelpipe] took 0,151 secs (0,923 CPU) processed `demosaic’ on CPU, blended on CPU [export]

49,272305 [dev_pixelpipe] took 10,292 secs (20,554 CPU) processed `denoise (profiled)’ on CPU, blended on CPU [export]

49,727749 [dev_pixelpipe] took 0,455 secs (3,397 CPU) processed `lens correction’ on CPU, blended on CPU [export]

49,740745 [dev_pixelpipe] took 0,013 secs (0,101 CPU) processed `exposure’ on CPU, blended on CPU [export]

49,890545 [dev_pixelpipe] took 0,150 secs (1,033 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

49,999909 [dev_pixelpipe] took 0,109 secs (0,800 CPU) processed `input color profile’ on CPU, blended on CPU [export]

image colorspace transform Lab–>RGB took 0,023 secs (0,153 CPU) [channelmixerrgb ]

50,386696 [dev_pixelpipe] took 0,387 secs (2,948 CPU) processed `color calibration’ on CPU, blended on CPU [export]

Note that ‘color calibration’ is the module immediately preceding ‘diffuse or sharpen’ in the pixelpipe of this image.

So, no dice with or without opencl. I understand that the GPU (which is an integrated device, not a standalone one) has very little memory, but still this is a fairly new macbook pro, a reasonably fast machine. Is the ‘local contrast’ preset of diffuse or sharpen module so resource intensive that exporting one 12MB image would require > 500 min? Also, is it correct that when using opencl the terminal says that it needs to compute 47 x 32 = 1504 tiles?

thanks again
giuseppe

I see darktable won’t use device #0, but it says there’s also a #1. Is that also discarded? Do you get a ‘discarding device 1’ message?

Do you get something like this after the kernels are loaded?

0.452765 [opencl_init] OpenCL successfully initialized.
0.452768 [opencl_init] here are the internal numbers and names of OpenCL devices available to darktable:
0.452769 [opencl_init]          0       'NVIDIA GeForce GTX 1060 6GB'
0.452773 [opencl_init] FINALLY: opencl is AVAILABLE on this system.
0.452774 [opencl_init] initial status of opencl enabled flag is ON.

Just some notes,

a) there have been some threads here about opencl on Mac’s, there are serious problems as far as I can remember.

b) Intel graphics are pretty slow, likely there is not much to win. Tiles as reported seem to be good. Graphics memory that small would always lead to heavy tiling and performance problems using open CL.

c) 4 cores and 16gb would probably be the sensible minimum to try such a module known for being CPU hungry.

You reported to kill DT. Does it finish export at all if you don’t interfere?

Yes, I do:

0.138636 [opencl_init] OpenCL successfully initialized.

0.138638 [opencl_init] here are the internal numbers and names of OpenCL devices available to darktable:

0.138640 [opencl_init] 0 ‘Intel(R) Iris™ Plus Graphics’

0.138642 [opencl_init] FINALLY: opencl is AVAILABLE on this system.

0.138644 [opencl_init] initial status of opencl enabled flag is ON.

My feeling is that the GPU memory is really too low:

0.123745 [opencl_init] device 1 Intel(R) Iris™ Plus Graphics’ supports image sizes of 16384 x 16384 0.123748 [opencl_init] device 1Intel(R) Iris™ Plus Graphics’ allows GPU memory allocations of up to 384MB [opencl_init] device 1: Intel(R) Iris™ Plus Graphics CANONICAL_NAME: intelri GLOBAL_MEM_SIZE: 1536MB MAX_WORK_GROUP_SIZE: 256 MAX_WORK_ITEM_DIMENSIONS: 3 MAX_WORK_ITEM_SIZES: [ 256 256 256 ] DRIVER_VERSION: 1.2(Nov 30 2021 21:37:29)
DEVICE_VERSION: OpenCL 1.2 0.126281 `

but then, what are my options? Refrain from using the Diffuse or Sharpen module (that would be really a pity, since it gives such nice results)?

I also tried to change the value of the parameter “opencl_memory_headroom” from the default value of 400 to 0, in .darktablerc. This reduced the number of tiles during exporting from 1504 to 70, but with around 1 min for each tile, it is still unusable:

73,544735 [default_process_tiling_cl_ptp] use tiling on module ‘diffuse’ for image with full size 4301 x 2867

73,544766 [default_process_tiling_cl_ptp] (10 x 7) tiles with max dimensions 2488 x 2488 and overlap 1024

73,544770 [default_process_tiling_cl_ptp] tile (0, 0) with 2488 x 2488 at origin [0, 0]

133,140329 [default_process_tiling_cl_ptp] tile (0, 1) with 2488 x 2427 at origin [0, 440]

191,798695 [default_process_tiling_cl_ptp] tile (1, 0) with 2488 x 2488 at origin [440, 0]

252,637711 [default_process_tiling_cl_ptp] tile (1, 1) with 2488 x 2427 at origin [440, 440]

Try to see if you can update your graphic card drivers. I’m not sure if this is possible with Mac.

One thing to remember with GPU and with the diffuse module (or similar), is not the speed (2Ghz) or memory (1.5gb) that matters the most, but the amount of computational chips inside the card. An integrated card will have less available processors to do all the math the module is asking.

If the module is taking too long, try to drop the number of iterations.

Thanks, not possible I am afraid. Hopefully there’ll be some optimization of the code down the line.

The funny thing is that in the darktable the rendering is quite slow but still usable: it is just during export (and slideshow) that it becomes so slow as to not be viable anymore.

Thanks, I knew there were problems with macs and opencl, but given that I couldn’t see any error in the terminal, I thought that maybe I had dodged that bullet.

Now, following your advice, I disabled opencl and I will let it finish the export through the night, hoping that won’t set the laptop on fire while I sleep. ;-). Will report the result in the morning…

This makes me think that cloud GPU would be a good use case for you. At least for the export part. You will need to pay like 1 euro/hr, but it should allow you to export without getting a new computer.

Fyi, i don’t think cloud GPU is feasible at the moment.

Still, what puzzles me is why diffuse or sharpen renders correctly in the developing section of darktable within acceptable times (even when zooming in 100%), but when exporting (with or without opencl) processing times increase by, I would say, 2-3 orders of magnitude. That doesn’t seem right to me, maybe @anon41087856 can chime in on this issue?

thanks in advance
giuseppe

diffuse and sharpen is a very performance consuming module - you can just try to reduce iterations. But aside from using a GPU there’s no magic that can improve cpu performance …

export is done with the whole image while 100% zoom just deals with the visible part of the image. So as long as you don’t have a display able to show the whole image in 100% export requires more performance…

My opinion :

far too little.

On Linux (Nvidia GPU) I had to increase this value to 800 to speedup processing.

I’m not sure about that. For example, my NVidia has 6 GB, but can allocate only 1.5 GB in one chunk (‘allows GPU allocations…’). But when used for processing, darktable allocates all available memory, just not in one operation. See

Of course, that integrated GPU may simply be way too slow for diffuse or sharpen.

The ‘preview pipeline’ is different from the one used for export, it only does a partial rendering, and if you are not zoomed to 100%, it does what export does when you run a scaled-down export and you disable ‘high quality resampling’. See

Ok, I understand that. But:

  • the original image is only 4448 x 2870 pixels
  • rendering the module diffuse or sharpen while full previewing the image in the darktable developing section on the laptop (screen=2560x1600 px), as rendered by hiding all side bars (TAB) takes around 20 sec:

145,093031 [dev_pixelpipe] took 21,418 secs (165,919 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [full]

I would expect that exporting may increase that time maybe tenfold but… a thousand-fold?!

If darktable cannot get hold of your GPU for a while (and we see it’s processing on the CPU, device -1), it falls back to the CPU. Maybe it does not do that with the preview. What do you see on the console when you move the sliders, where is the preview pipeline processed?

With darktable 3.8.1, you’ll get a warning about the fallback.
See https://github.com/darktable-org/darktable/issues/10828 – there will be a warning message logged, and the default value of timeout (before falling back to CPU) will also be higher (but this won’t affect you if it’s already set in your config file)

You could try:

  • setting opencl_scheduling_profile=very fast GPU (either in darktablerc, or also available on the GUI)
  • increasing opencl_mandatory_timeout (default is 200, which means 1 second; I’ve raised mine to 20000, or 100 s, exactly because of diffuse or sharpen). However, such an extreme setting may lead to a complete hang if the driver is broken.

Thanks @kofa @hannoschwalm @g-man and @MStraeten,

just to clarify, export is impossibly slow with or without opencl. With opencl active, darktable does not complain but reports taking about 20 sec per tile and requires computing 47x32 tiles! It still boggles my mind that a module would be so resource intensive that on a new, fairly fast laptop, exporting a single image would take (per my simple estimates) > 8 hours. I feel something is not scaling properly…

On the other hand, if I turn off opencl, I have no idea of the progress of the exporting process as the terminal does not report tiling information. At any rate, with opencl off it still takes an inordinate (and unknown) amount of time: I started the process last night and in the morning there was no sign of completion.

I wonder if I am the only seeing this: surely there are other people running darktable on laptops that do not have separate, large-RAM GPUs…