Darktable Performance macOS, or: Don't be Stupid Like Me

kofa · August 31, 2024, 6:57pm

I wonder how your stats change if you enable diffuse or sharpen and denoise (profiled), which are known to be quite resource-hungry. My OpenCL time jumped to 2 seconds, CPU-only to about 6 (‘no AA filter’ preset and the default settings, respectively).

zurdo · September 1, 2024, 7:58am

Hi Bastian, just to make sure I am doing the test correctly. I open your image, change a little exposure, and then export at full resolution.
Well on my iMac, Intel i7 3,8 GHz 8 core with AMD Radeon Pro 5700 64 GB memory, got the following results

Export default: 0,333 sec
Export Multiple GPUs: 0,338 sec
Export Very fast GPU: 0,359 sec

These results looks to me very close to the ones for the M2 Max. Thus, I thing that I may be doing something wrong or I may be saving some $ because I was thinking in going to a M2 Mac . Here are the results for the export default

kofa · September 1, 2024, 9:38am

Were you exporting at full resolution?

Changing exposure is not required to measure export performance; that was only for the other measurement:

Performance was tested in two modes, one render at screen resolution (during editing), and one at export resolution (high quality mode). These are the first and second number, respectively, each given in seconds.
During exporting, only one pipeline is running, so the scheduling profile (default/multiple GPUs/very fast GPU) should play no role; the resources setting may.

You can include -d pipe on the command-line to also show the image dimensions the pipeline was processing.

I’ve compared the Radeon Pro 5700 and my NVidia GXT 1060 (a much older card). According to the different benchmarks, the performance difference should be between 15 and 40 percent (e.g. GPU Compute 5889 Ops/Sec vs 4321 Ops/Sec (-26.6%) at Radeon Pro 5700 XT vs GeForce GTX 1060 [videocardbenchmark.net] by PassMark Software, or the GTX 1060 being rated at 75% the performance of the RX Pro 5700 at AMD Radeon Pro 5700 Specs | TechPowerUp GPU Database), but certainly not as large as what you saw (1.2 s vs 0.3 s; unless again it’s unified memory, but it appears to me it’s a discrete card).

bastibe · September 1, 2024, 10:48am

I’m mostly interested in editing performance, not export times. Thus I run darktable from the command line, with -d perf, and click on the exposure slider. In the terminal, you see performance measurements scroll by for that particular exposure change.

The provided image has only a bare minimum of edits applied. So the performance measurements thus measured represent a lower bound on what you can expect during real editing.

The timings also scale with the number of pixels that need to be rendered; hence a full-HD measurement is much faster than a 4k render.

zurdo · September 1, 2024, 3:41pm

thank you both of you. I tested again, the export is at full resolution and took 0,334 secs (the graphics card has 16 GB of memory). And the timing for expo changing is with default 0,101 secs and with very fast GPU 0,083 secs.
I add the results in case that I am looking at the wrong numbers.

hatsnp · November 2, 2024, 8:44am

How do you get those logs showing the scaling? Maybe it’s something to do with my config but I get no scaling at all on a 4k screen.

Either way:
[dev_process_image] pixel pipeline took 0.213 secs (0.381 CPU) processing `DSCF7182.RAF'
[dev_process_export] pixel pipeline processing took 0.321 secs (0.627 CPU)

Very Fast GPU:

[dev_process_image] pixel pipeline took 0.138 secs (0.374 CPU) processing `DSCF7182.RAF'
[dev_process_export] pixel pipeline processing took 0.311 secs (0.672 CPU)

Interesting how the very fast gpu improves the editing speeds but not particularly the export.

Edit: Sorry for the giant necro Realized a little too late

kofa · November 2, 2024, 12:13pm

Meanwhile, I realised I had turned on the option to use LCMS2 to apply the output profile. That’s a surprisingly expensive operation.

Pizzacutter · November 2, 2024, 8:18pm

Can you do that on Windows?

kofa · November 2, 2024, 8:41pm

Yes

Pizzacutter · November 2, 2024, 9:00pm

How? I attempted to replicate the Bash way in PowerShell by locating DT executable and typing

.\darktable.exe -d perf

But that just launches DT as normal, no output in the command prompt. It doesn’t generate a text file somewhere, right?

kofa · November 2, 2024, 9:17pm

Most recent version:

Before: C:\Users\<userName>\AppData\Local\Microsoft\Windows\INetCache\darktable

Pizzacutter · November 2, 2024, 9:24pm

Oh thank you for the edit. I was like weeeeell I’m not running current master, Documents folder doesn’t contain Darktable.

Pizzacutter · November 2, 2024, 9:49pm

Ran the raw export on Intel i7-11370H (4C/8T 3.0GHz) with Iris Xe 96 core GPU:

OpenCL	Time
off	4.248 secs (30.812 CPU)
default	5.233 secs (3.453 CPU)
very fast GPU	5.080 secs (2.797 CPU)

I don’t understand 30.812 CPU though, it really just took those 4.2 seconds. And also, this seems to mean that OpenCL actually drags me down on this image. Overall OpenCL seems to make smaller difference than I originally though (I used stopwatch to time the export before )

Altough testing export on other image with some drawn masks it still is 2x as fast with OpenCL, so idk, it depends I guess.

g-man · November 2, 2024, 9:56pm

Share your logs to understand what’s going on.

Pizzacutter · November 2, 2024, 10:03pm

Here:
opencl_test.txt (5.6 KB)

The slowest thing seems to be demosaicing.

g-man · November 2, 2024, 10:10pm

This is the problem. Your iGPU does not have enough memory to process the entire image. Therefore darktable has to split it into multiple smaller ones and the stitch it together with some overlap. This is called tiling and it results in longer execution time. Can you increase the available memory to your iGPU?

demosaic on GPU with tiling,

Pizzacutter · November 2, 2024, 10:20pm

I changed settings to → Use all device memory (using OpenCL: default)
and the time is slightly faster, this is the log:
opencl_test_2.txt (1.8 KB)

Demosaic now doesn’t tile and it’s roughly 0.8 seconds faster.
Overall the time is variable due to thermals of my system, but the extra memory seems to cut about half a second here.

g-man · November 2, 2024, 10:51pm

Don’t use the All device memory. Its not safe to use. I think we should hide that setting in the darktablerc.

Pizzacutter · November 2, 2024, 10:53pm

Well, i don’t know any other way to give DT more RAM, perhaps I can only free up some.

g-man · November 2, 2024, 10:54pm

The manual explains some options. The easiest, switch to large resources