Slideshow and tiling on a Mac (diffuse and sharpen?): extremely slow

@gpagnon , would you be ok to share the image and your xmp? I’m interested to see how long it takes to export with my OpenCL off.

Sure. Should I post it here or send it by email (perhaps easier so I don’t have to go through the licensing stuff)?

Please post it here. I’d also like to test this on my machine. You don’t have to go through any licencing stuff, just say it’s under CC0 or whatever (you don’t even have to allow us to redistribute, or to create derivative works).

Here it is, thanks for having a look!

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

DSCF4570.raf (19.0 MB)
DSCF4570.raf.xmp (9.0 KB)

darktable 3.8.0 from the darktable site. This is without OpenCL, on a laptop. i5-10210U, 16 GB RAM. JPEG export (I just reset the params of the export module, then enabled high quality resampling (but that should not matter, as I was exporting at full size)).
general/prefer performance over quality: unchecked

62.267869 [dev] took 0.000 secs (0.000 CPU) to load the image.
62.493615 [export] creating pixelpipe took 0.217 secs (0.734 CPU)
62.497613 [dev_pixelpipe] took 0.003 secs (0.000 CPU) initing base buffer [export]
62.503167 [dev_pixelpipe] took 0.005 secs (0.016 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
62.509267 [dev_pixelpipe] took 0.006 secs (0.000 CPU) processed `white balance' on CPU, blended on CPU [export]
62.515188 [dev_pixelpipe] took 0.005 secs (0.016 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
62.637889 [dev_pixelpipe] took 0.122 secs (0.609 CPU) processed `demosaic' on CPU, blended on CPU [export]
68.842915 [dev_pixelpipe] took 6.204 secs (38.109 CPU) processed `denoise (profiled)' on CPU, blended on CPU [export]
69.506373 [dev_pixelpipe] took 0.662 secs (3.688 CPU) processed `lens correction' on CPU, blended on CPU [export]
69.533293 [dev_pixelpipe] took 0.026 secs (0.125 CPU) processed `exposure' on CPU, blended on CPU [export]
69.813110 [dev_pixelpipe] took 0.279 secs (1.547 CPU) processed `tone equalizer' on CPU, blended on CPU [export]
69.982792 [dev_pixelpipe] took 0.169 secs (1.047 CPU) processed `input color profile' on CPU, blended on CPU [export]
image colorspace transform Lab-->RGB took 0.063 secs (0.375 CPU) [channelmixerrgb ]
70.642730 [dev_pixelpipe] took 0.659 secs (3.922 CPU) processed `color calibration' on CPU, blended on CPU [export]
106.555841 [dev_pixelpipe] took 35.912 secs (207.172 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [export]
112.351773 [dev_pixelpipe] took 5.795 secs (39.891 CPU) processed `color balance rgb' on CPU, blended on CPU [export]
113.208068 [dev_pixelpipe] took 0.855 secs (5.859 CPU) processed `filmic rgb' on CPU, blended on CPU [export]
image colorspace transform RGB-->Lab took 0.039 secs (0.188 CPU) [colorout ]
115.218797 [dev_pixelpipe] took 2.010 secs (13.516 CPU) processed `output color profile' on CPU, blended on CPU [export]
115.277924 [dev_pixelpipe] took 0.058 secs (0.328 CPU) processed `display encoding' on CPU, blended on CPU [export]
115.279126 [dev_process_export] pixel pipeline processing took 52.785 secs (315.844 CPU)
[export_job] exported to `C:\Users\whatever/darktable_exported/DSCF4570.jpg'

I’m running DT 3.8 + windows insider on Windows 11. AMD Ryzen 7 5700G with Radeon Graphics 3.80 GHz and 16Gb Memory.

Export to jpg at 100 quality and yes to high quality resampling

Using the Nvidia RTX 3060 12gb 3584 cores and OpenCL:
59.223316 [dev_process_export] pixel pipeline processing took 2.411 secs (4.625 CPU)

OpenCL disabled:
42.486889 [dev_process_export] pixel pipeline processing took 16.224 secs (226.844 CPU)
The worst offender was:
40.655046 [dev_pixelpipe] took 11.890 secs (164.188 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [export]

The 3584 cores do make a difference with OpenCL, but this is still 16s to export and not hours. I see Kofa was 53secs. It would be interesting to see if someone with a Mac build can do the same experiment (export of your xmp with no OpenCL). My linux machine is down at the moment, so I cant test with it.

I noticed the diffuse was set to 10 iterations. I checked the image with less iterations and drop it to 2 and I cant see a difference at 100%. Turning it off I can see a difference. The export with the iterations at 2 and OpenCL off:
263.606901 [dev_process_export] pixel pipeline processing took 6.614 secs (95.828 CPU)

What are your settings for host memory limit for tilling and for minimum amount of memory for a single buffer tilling?

This sounds suspicious. To me it reads as if you were processing tiles of roughly 2000 pixels on the size (4 MPixels), with an overlap of 1000 pixels with the neighbour? And your image is subdivided into 47x32 = 1504 tiles, so you process 1500 * 4 MPx = 6 GigaPixels; while your input is about 12 MPx. So one tile has 12 MPx/1500 ~= 8500 Px of useful data.
Cross-checking:
Tile size: 2140 pixels on the size
Overlap with neighbours: 1024 pixels on all sides, 2048 pixels
Useful area in middle of tile: a square with sides of (2140 - 2048) = 92 Pixels, total number of pixels in useful area = 8464. This matches quite well with the 12 MPx/1500 ~= 8500 seen above.

My settings related to tiling and memory (from the Windows laptop used above, I did not tweak things there much):
maximum_number_tiles=10000
cache_memory=214932504576
host_memory_limit=8078
singlebuffer_limit=32

https://docs.darktable.org/usermanual/3.8/en/special-topics/memory/#setting-up-darktable-on-32-bit-systems

host memory limit (in MB) for tiling host_memory_limit
This parameter tells darktable how much memory (in MB) it should assume is available to store image buffers during module operations. If an image can not be processed within these limits in one chunk, tiling will take over and process the image in several parts, one after the other. Set this to the lowest possible value of 500 as a starting point. You might experiment later whether you can increase it a bit in order to reduce the overhead of tiling.

minimum amount of memory (in MB) for a single buffer in tiling singlebuffer_limit
This is a second parameter that controls tiling. It sets a lower limit for the size of intermediate image buffers in megabytes. The parameter is needed to avoid excessive tiling in some cases (for some modules). Set this parameter to a low value of 8. You might tentatively increase it to 16 later.

Those descriptions above are for the no longer supported 32-bit systems, BTW, so they are extremely conservative.

My bad. I just noticed you already posted this. I suggest you try setting the host memory limit to 0. I think 0 = no limit.

I’m hoping this avoid the need to create tiles at all. My process is not creating them.

-d memory could also shed some light on how much memory darktable thinks is available.

This is what I see when I start darktable on this 16 GB machine with the params I posted above:

[memory] at startup
[memory] max address space (vmpeak):        42988 kB
[memory] cur address space (vmsize):        40488 kB
[memory] max used memory   (vmhwm ):        22344 kB
[memory] cur used memory   (vmrss ):        22340 Kb
...
[memory] after successful startup
[memory] max address space (vmpeak):      1099928 kB
[memory] cur address space (vmsize):      1099920 kB
[memory] max used memory   (vmhwm ):       177376 kB
[memory] cur used memory   (vmrss ):       177372 Kb

And when exporting:

14.046108 [export] creating pixelpipe took 0.218 secs (0.672 CPU)
[memory] before pixelpipe process
[memory] max address space (vmpeak):      1562640 kB
[memory] cur address space (vmsize):      1559084 kB
[memory] max used memory   (vmhwm ):       243048 kB
[memory] cur used memory   (vmrss ):       228924 Kb

My results kind of similar to yours when not using opencl, except that it gets stuck forever in processing the diffuse or sharpen module:

100,787448 [export] creating pixelpipe took 0,123 secs (0,436 CPU)

100,787506 [pixelpipe_process] [export] using device -1

100,795007 [dev_pixelpipe] took 0,007 secs (0,010 CPU) initing base buffer [export]

100,812557 [dev_pixelpipe] took 0,018 secs (0,089 CPU) processed `raw black/white point’ on CPU, blended on CPU [export]

100,819860 [dev_pixelpipe] took 0,007 secs (0,020 CPU) processed `white balance’ on CPU, blended on CPU [export]

100,823273 [dev_pixelpipe] took 0,003 secs (0,024 CPU) processed `highlight reconstruction’ on CPU, blended on CPU [export]

100,986266 [dev_pixelpipe] took 0,163 secs (0,943 CPU) processed `demosaic’ on CPU, blended on CPU [export]

111,357721 [dev_pixelpipe] took 10,371 secs (21,261 CPU) processed `denoise (profiled)’ on CPU, blended on CPU [export]

111,837032 [dev_pixelpipe] took 0,479 secs (3,470 CPU) processed `lens correction’ on CPU, blended on CPU [export]

111,849873 [dev_pixelpipe] took 0,013 secs (0,100 CPU) processed `exposure’ on CPU, blended on CPU [export]

112,010557 [dev_pixelpipe] took 0,161 secs (1,064 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

112,118910 [dev_pixelpipe] took 0,108 secs (0,804 CPU) processed `input color profile’ on CPU, blended on CPU [export]

image colorspace transform Lab–>RGB took 0,022 secs (0,152 CPU) [channelmixerrgb ]

112,501322 [dev_pixelpipe] took 0,382 secs (2,935 CPU) processed `color calibration’ on CPU, blended on CPU [export]

Thanks, I don’t know much about tiling, but it sounded suspicious to me too…

I´m just trying to export it on my mac. Problem on my iMac with the graphiccard GeForce GT 755M the system wasn´t able to compile the opencl Kernel of diffuse.cl.

Tried to export with opencl enabled (excpet diffuse), it also takes forever so that I aborted it.
Something strange I´ve discoverred, for denoiseprofile it uses 3x1 tiles.

26.593685 [default_process_tiling_cl_ptp] use tiling on module 'denoiseprofile' for image with full size 4295 x 2865
26.593722 [default_process_tiling_cl_ptp] (3 x 1) tiles with max dimensions 2228 x 2865 and overlap 128
26.593728 [default_process_tiling_cl_ptp] tile (0, 0) with 2228 x 2865 at origin [0, 0]
27.361006 [default_process_tiling_cl_ptp] tile (1, 0) with 2228 x 2865 at origin [1972, 0]
27.858873 [default_process_tiling_cl_ptp] tile (2, 0) with 351 x 2865 at origin [3944, 0]

For diffuse it seems to use other tiling?
29.053169 [default_process_tiling_cl_ptp] aborted tiling for module 'diffuse'. too many tiles: 4301 x 2867

Also tried without opencl for the other modules, but also the process takes to long and cpu is running hot.

Success, thanks for the tip!

Setting host memory limit for tiling to 0, export is completed in 70 sec without opencl:

14,761470 [dev] took 0,051 secs (0,059 CPU) to load the image.

14,896898 [export] creating pixelpipe took 0,122 secs (0,407 CPU)

14,896952 [pixelpipe_process] [export] using device -1

14,904946 [dev_pixelpipe] took 0,007 secs (0,012 CPU) initing base buffer [export]

14,922587 [dev_pixelpipe] took 0,018 secs (0,091 CPU) processed `raw black/white point’ on CPU, blended on CPU [export]

14,929971 [dev_pixelpipe] took 0,007 secs (0,020 CPU) processed `white balance’ on CPU, blended on CPU [export]

14,933903 [dev_pixelpipe] took 0,004 secs (0,022 CPU) processed `highlight reconstruction’ on CPU, blended on CPU [export]

15,098161 [dev_pixelpipe] took 0,164 secs (0,969 CPU) processed `demosaic’ on CPU, blended on CPU [export]

25,436375 [dev_pixelpipe] took 10,338 secs (20,704 CPU) processed `denoise (profiled)’ on CPU, blended on CPU [export]

25,908296 [dev_pixelpipe] took 0,472 secs (3,465 CPU) processed `lens correction’ on CPU, blended on CPU [export]

25,922985 [dev_pixelpipe] took 0,015 secs (0,086 CPU) processed `exposure’ on CPU, blended on CPU [export]

26,101326 [dev_pixelpipe] took 0,178 secs (1,023 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]

26,226097 [dev_pixelpipe] took 0,125 secs (0,805 CPU) processed `input color profile’ on CPU, blended on CPU [export]

image colorspace transform Lab–>RGB took 0,022 secs (0,147 CPU) [channelmixerrgb ]

26,608317 [dev_pixelpipe] took 0,382 secs (2,892 CPU) processed `color calibration’ on CPU, blended on CPU [export]

84,104969 [dev_pixelpipe] took 57,497 secs (434,306 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [export]

85,582736 [dev_pixelpipe] took 1,478 secs (11,553 CPU) processed `color balance rgb’ on CPU, blended on CPU [export]

85,836541 [dev_pixelpipe] took 0,254 secs (1,870 CPU) processed `filmic rgb’ on CPU, blended on CPU [export]

image colorspace transform RGB–>Lab took 0,030 secs (0,222 CPU) [colorout ]

85,908907 [dev_pixelpipe] took 0,072 secs (0,504 CPU) processed `output color profile’ on CPU, blended on CPU [export]

85,942115 [dev_pixelpipe] took 0,033 secs (0,240 CPU) processed `display encoding’ on CPU, blended on CPU [export]

85,942138 [dev_process_export] pixel pipeline processing took 71,045 secs (478,567 CPU)

[export_job] exported to `/Users/gp/Pictures/EXPORT/acqua/darktable_exported/DSCF4570.jpg’

On the other hand, if I enable opencl, it gets stuck as before…

2 Likes

This is what I get (with “host memory limit for tiling = 0”, don’t know if it’s relevant for the data below):

[memory] at startup
[memory] max address space (vmpeak): unknown
[memory] cur address space (vmsize): 34311276 kB
[memory] max used memory (vmhwm ): unknown
[memory] cur used memory (vmrss ): 12804 kB

[memory] after successful startup
[memory] max address space (vmpeak): unknown
[memory] cur address space (vmsize): 34520416 kB
[memory] max used memory (vmhwm ): unknown
[memory] cur used memory (vmrss ): 110928 kB

and when exporting:

[memory] before pixelpipe process
[memory] max address space (vmpeak): unknown
[memory] cur address space (vmsize): 35319900 kB
[memory] max used memory (vmhwm ): unknown
[memory] cur used memory (vmrss ): 299376 kB

I don’t know whether I should be worried about those “unknown” values…

Awesome news. The default of 1500 in DT seems low with most modern system. I’m not even sure why have a limit. I have my system set to 0.

Now, let see if we can improve OpenCL in your system. I am concern with the 384mb gpu memory that the system reports. I would first try to adjust the OpenCL memory headroom in the config file. Try 800 and then 1200 to see if there is any benefit.

Setting host memory limit for tiling to 0

Did the trick for me either.

154.951228 [default_process_tiling_cl_ptp] aborted tiling for module 'diffuse'. too many tiles: 4301 x 2867
154.951244 [opencl_pixelpipe] could not run module 'diffuse' on gpu. falling back to cpu path
274.280407 [dev_pixelpipe] took 119.368 secs (360.360 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [export]
...
280.792424 [dev_process_export] pixel pipeline processing took 129.130 secs (364.240 CPU)`

Tiling seems just to be an opencl thing?

It’s still weird, because with 1.5 GB available, I see no reason why a 12 MPixel image would be split into 1500 pieces. However, with your parameters (host_memory_limit=1500, singlebuffer_limit=16) I get even worse performance: 1944 tiles, of size 80x80 pixels:

26.238423 [default_process_tiling_ptp] use tiling on module 'diffuse' for image with full size 4301 x 2867
26.238427 [default_process_tiling_ptp] (54 x 36) tiles with max dimensions 2128 x 2128 and overlap 1024
26.238448 [default_process_tiling_ptp] tile (0, 0) with 2128 x 2128 at origin [0, 0]
28.094228 [lighttable] expose took 0.0000 sec
29.987606 [default_process_tiling_ptp] tile (0, 1) with 2128 x 2128 at origin [0, 80]

So, resolved. But I’ll open a feature request to increase the defaults.

No, tiling is not ‘an opencl thing’.
See darktable 3.8 user manual - memory

Here are my results

with headroom = 800

23,392663 [default_process_tiling_cl_ptp] aborted tiling for module ‘diffuse’. too many tiles: 4301 x 2867
23,392690 [opencl_pixelpipe] could not run module ‘diffuse’ on gpu. falling back to cpu path
83,435404 [dev_pixelpipe] took 60,072 secs (451,092 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [export]
[…]
86,556387 [dev_process_export] pixel pipeline processing took 65,264 secs (455,291 CPU)

with headroom = 1200

17,799217 [default_process_tiling_cl_ptp] aborted tiling for module ‘diffuse’. too many tiles: 4301 x 2867
17,799227 [opencl_pixelpipe] could not run module ‘diffuse’ on gpu. falling back to cpu path
78,031524 [dev_pixelpipe] took 60,234 secs (452,320 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [export]
[…]
80,640850 [dev_process_export] pixel pipeline processing took 65,443 secs (459,138 CPU)

so, although with both settings diffuse or sharpen falls back on being computed on the CPU, the other modules are faster, giving an overall advantage to using open with headroom = 800 or 1200

In summary:

  • open cl disabled: export time = 71 sec
  • open cl enabled, headroom = 400: engages the GPU for diffuse or sharpen: export time = TOO LONG
  • open cl enabled, headroom = 800 or 1200: falls back on CPU for diffuse or sharpen: export time = 65 sec

I gather that the only effect of increasing headroom here is to disable opencl for the diffuse or sharpen module, correct? Note that the integrated graphic card reports:

0.123745 [opencl_init] device 1 Intel(R) Iris™ Plus Graphics’ supports image sizes of 16384 x 16384 0.123748 [opencl_init] device 1 Intel(R) Iris™ Plus Graphics’ allows GPU memory allocations of up to 384MB [opencl_init] device 1: Intel(R) Iris™ Plus Graphics CANONICAL_NAME: intelri GLOBAL_MEM_SIZE: 1536MB

so perhaps increasing the headroom does not leave enough GPU memory for computing tiles?