Slideshow and tiling on a Mac (diffuse and sharpen?): extremely slow

Different demosaic. Perhaps the X-Trans algorithm is leaking memory, or simply needs more?

Not x-trans though. This was the first fuji x100 with Bayer sensor (I miss those colors!)

At least it tries to do tiling, but on my computer without succes:

22.583654 [default_process_tiling_cl_ptp] aborted tiling for module 'diffuse'. too many tiles: 5890 x 4018
22.583677 [opencl_pixelpipe] could not run module 'diffuse' on gpu. falling back to cpu path
22.583698 [default_process_tiling_ptp] gave up tiling for module 'diffuse'. too many tiles: 5890 x 4018

Oh, sorry, then bad guess/memory. I’ve already shut down the computers.

I think it is simply because by the change of host memory to 0 (not limit to the available memory), the system can use and load the entire image to memory without the need to tiling. But because the openCL doesnt have enough memory, it still needs to do the tiling. It is creating so many tiles and processing all of them and storing the results to then merge them. It the end it is too much, thus failing it to CPU makes sense in your system.

Increasing the headroom forces the system to leave more GPU memory available for other tasks (per the manual), therefore when the diffuse tries to use the GPU it notices it doesnt have enough memory and changes to CPU.

I would like for someone with more knowledge around when/how DT uses the tile to chime in.

1 Like

in some cases the compilation of kernels needs several attempts. So try this:

Thanks Martin, I know the linked issue and have a loop running to enable open CL support.

I already raised an issue here: https://github.com/darktable-org/darktable/issues/9572

But wasn’t able to solve the problem.

Opened https://github.com/darktable-org/darktable/issues/10910 to track the tiling issue (caused by too low default value for host_memory_limit).

Thank you!

I am not sure this is my case though, as I don’t recall (I am not on that machine now) seeing messages about failed compilation of opencl kernels in the terminal.

Yesterday I started to explore the subject and found some discussion. https://github.com/darktable-org/darktable/issues/10884

The pull request from johnny-bit was merged in August and should address this. https://github.com/darktable-org/darktable/pull/9764

I think as more user start to use the diffuse module, the need for available memory is increasing.

1 Like

Thanks, closed the issue. That’s what I seem to be doing these days: open a feature req, then realise it’s already done. But then why did @gpagnon have the issue? Shouldn’t PR 9764 have taken care of updating the memory limit param?

I’m still reading the code changes from that pull request. It seems that the new performance configuration is only used when the version is set to 2.

I think they bumbed DT_CURRENT_PERFORMANCE_CONFIGURE_VERSION from 1 to 2; and if darktable detects that the one in darktablerc is old (1), it prompts the user:

Interesting, because I am pretty sure that when I installed 3.8.0 dmg from darktable website, in response to that message, I consented to having my old configuration updated by the installation.

You could try setting performance_configuration_version_completed=1 in your darktablerc and retry. :slight_smile:

I tested the setting of configuration version set to 1. I restarted darktable and it did ask to perform the logic change. But the system selected only 8Gb instead of the 16 I have.

And I see why:

if(mem >= (8lu << 20) && threads > 4 && atom_cores == 0)

The logic is base on memory and CPU cores (threads), otherwise it keeps the setting as before (1500). That means that you need a CPU with 5 core (greater than 4), for this to run. That is not very common (I have 6). The next step in the logic is

if(mem >= (16lu << 20) && threads > 6 && atom_cores == 0)

I do have the memory, but I have 6 cores and not 7, therefore it is not going to follow this path.

The OP has a quad core, therefore the logic update did nothing for him. If I understood the code correctly.

Therefore, I think we do need a Pull Request. I would propose to eliminate the CPU core as part of this memory logic since I dont see them connected and it is setting the bar too high. Maybe make it greater than equal to 2, but definitely do use greater than 4.

1 Like

I agree. It sounds silly not to use the memory if it’s available. If you have few cores, and darktable switches to tiling, those few cores will have to work even harder. (Threads may not be the same as cores, because of hyperthreading, but I have not looked into how the number of threads is determined.)

Update
I think they do it like this (checking both installed memory and threads) because they don’t just tweak memory usage parameters, but also decide on settings that affect the choice of algorithms to use: demosaicer for zoomed-out mode in the darkroom, and ui/performance (‘prefer performance over quality’), which seems to affect:

I think it’d make sense to separate the two:

  • tweak algorithm/quality settings based on CPU and maybe also on memory;
  • tweak purely memory-related settings based on available memory only.

@g-man : are you going to raise a new issue? Or should we resurrect mine (https://github.com/darktable-org/darktable/issues/10910)?

I was going to raise an issue/pull request to modify the code, but I’m in no rush. I started to read the actual code in master to see if there are more changes since that last pull request. I think I noticed some other changes that use the >= 2, so it needs more investigation. I’m currently busy with work.

Background info:

diffuse or sharpen uses a wavelets decomposition with maximum 10 scales and needs to store the high frequency buffers for each scale plus the residual, because the diffusion process works coarse to fine.

Each scale has a blur size equal to 2^scale, so for the last scale, the radius is 1024 px.

For the tiled variant, a tile overlapping equal to the blur radius is necessary for numerical consistency with the direct approach. Meaning a padding of 1024 px is necessary on each side for the largest size. So the tile size is defined by a mandatory padding first, then the center region is filled as much as possible until RAM is saturated. Problem is the padding region gets computed 2 to 4 times instead of once.

When the image is downscaled, for example in the preview, the image highest frequencies are removed so the wavelets decomposition discards the n first scales and processes only the last. Also the blur radii are scaled by the zoom factor, so the coarsest scale radius is 1024 px * zoom, and so is the padding, which explains the performance boost.

Unfortunately, at export time, if you export at 1:1, no speed-up boost for you.

Diffusion is an iterative process and there is no other way. Actually, the wavelets scheme is already a speed-up, because it will get you in ~32 iterations similar results to what is achieved in the litterature with 100 to 150 iterations.

Also, diffusion is kind of a convolutional neural network and borderline AI. People have been asking for AI shit for years in dt, that’s the runtimes you get wich such methods.

Magic has a cost.

1 Like