Darktable Performance macOS, or: Don't be Stupid Like Me

bastibe · August 10, 2024, 8:14am

Oh boy, I feel stupid.

Remember how I said my M1 Mac Mini was so slow in the Hello again, darktable thread? And then I proceeded to spend €2350 on a Mac Studio M2 Max, to speed things up? Well, today I did some performance measurements, to verify that my purchase was worth it and return it if it wasn’t.

To do this, I hooked up the two computers to my two 4k screens, both running the same 5k-ish internal resolution, opening the same picture with the same adjustments, with darktable -d perf to measure computation times in a nearby terminal window.

I’ll first change an operation early in the pixelpipe, which should take a long time, then an operation late in the pixelpipe, which should be fast:

Operation	M1	M2 Max
Exposure	1.840	0.574
Color Balance Brilliance	0.316	0.189

This looks like a clear win for the M2 Max! It is indeed much faster, exactly what I wanted!

And yet, while this is a worthwhile improvement, it is not exactly fast in its own right. 0.5 seconds is a pretty noticeable delay still.

While looking at the performance numbers, I noticed that the preview and full image were always calculated sequentially, on the GPU. And I remembered that they were supposed to be calculated in parallel on GPU and CPU.

To test if this makes a difference, I therefore switched the performance profile from “very fast GPU” to “default”. Which indeed changed things… you’ll be the judge:

Operation	M1	M2 Max
Exposure	0.378	0.225
Color Balance Brilliance	0.212	0.128

So there you have it. All of you saying that my Mac was strangely slow, were right. I had misconfigured my Darktable. Why I did so I don’t know. Presumably I had my reasons at the time. Perhaps this was a leftover from an earlier version of Darktable? Perhaps it was a leftover from the previous computer, which had a much weaker CPU?

The new computer is still noticeably faster in practice. But I’m seriously considering sending it back regardless. The M1 is actually quite usable now.

TL;DR: switching my performance profile from “very fast GPU” to “default” was much more worthwhile than upgrading my computer.

MStraeten · August 10, 2024, 12:17pm

you might also tweak the prioritization for full, preview, thumbnail, export an 2nd preview pipe following darktable 4.6 user manual - multiple devices
(That just works with default setting)

bastibe · August 11, 2024, 10:55am

I will have a look, thank you. But the M2 has only one CPU and one GPU device, as far as I know. The default should therefore be adequate.

bastibe · August 29, 2024, 6:25pm

I couldn’t stop myself, and compared darktable performance across all of my computers. As a fun aside, this includes a custom-compiled arm64 build for a new Snapdragon-based Windows tablet.

All the computers were connected to the same 4K displays, editing the same photo in the same way, performance being measured by darktable’s performance logging feature.

Darktable can run in CPU mode or GPU/OpenCL mode. On the snapdragon, currently only CPU mode is available (OpenCL crashes).

Machines I have available:

Surface Pro 11 with Snapdragon Plus
Mac Studio with M2 Max
Work Laptop with i7-12700H and NVidia T1200
Gaming Desktop with i5-10400 and NVidia GTX 3060

Performance was tested in two modes, one render at screen resolution (during editing), and one at export resolution (high quality mode). These are the first and second number, respectively, each given in seconds. Each measurement was repeated multiple times and averaged. The second render has 24 MP, which is about 10 times as many pixels as the ~2.5 MP of the display area on my 4K screen.

0.16s/1.6s for Snapdragon CPU
0.13s/1.3s for M2 CPU
0.08s/0.31s for M2 GPU
0.22s/2s for i7-12700H CPU
0.35s/0.88s for NVidia T1200 GPU
0.3s/3.1s for i5-10400 CPU
0.2s/0.75s for NVidia RTX 3060 GPU

To nobody’s surprise, the M2 max is the fastest CPU in this bunch. Somewhat surprisingly, it is by far the fastest GPU as well, easily beating the dedicated NVidia GPUs. Very surprisingly, the little Surface tablet CPU was faster than the workstation and gaming CPU and GPU for editing tasks. Only for exporting resolutions did the dedicated GPUs best it. Darktable on the Surface tablet is in fact a joy to use!

priort · August 29, 2024, 6:39pm

Now I have to go back and check because I was using very fast and thought it gave the best times on my PC

reox · August 29, 2024, 7:16pm

huh, interesting! Tested for myself now on a Ryzen 3900, RT7800XT, 64GB RAM, Samsung 990 Pro, darktable 4.9.0~git306.c728e031-1+11881.1, Debian testing

Test script:

#!/bin/bash
NRUNS=3
DARKTABLE="darktable-cli setubal.orf setubal.orf.xmp test.jpg --core -d perf"

for i in $(seq $NRUNS); do
    $DARKTABLE -d opencl 2>/dev/null | grep "pixel pipeline processing" | grep -o -e "took [0-9.]\+ secs"
    rm test.jpg
done

(Benchmark files can be found here: GPU benchmarks in darktable)

Very Fast GPU:

took 1.043 secs
took 1.014 secs
took 1.023 secs

default:

took 1.009 secs
took 0.996 secs
took 1.006 secs

Indeed - it is faster…

However, now I ask myself: Should this setting actually matter when using darktable-cli? The manual says it is just for the preview:

With this scheduling profile darktable processes the center image view and the preview window on the GPU sequentially

edit: I thought that the observed differences may be accidental, updated to 4.9.0~git335.5907696d-1+11894.1 (because my version had a bug in darktable-cltest) and ran the benchmark 30 times each:
default:

DescribeResult(nobs=30, minmax=(1.003, 1.147), mean=1.0305666666666666, variance=0.0014552885057471283, skewness=2.274781477564424, kurtosis=4.084471326893757)

fast:

DescribeResult(nobs=30, minmax=(0.998, 1.194), mean=1.0295666666666667, variance=0.0017077022988505753, skewness=2.7238815413253428, kurtosis=7.338038494833576)

and for good measure:

>>> from scipy.stats import ttest_ind
>>> ttest_ind(df['default'], df['fast'])
Ttest_indResult(statistic=0.09738939382490723, pvalue=0.9227529364094245)

(can you do t-test in these cases?)

paperdigits · August 29, 2024, 7:39pm

Running windows? Or??

bastibe · August 30, 2024, 4:07am

All tests were run on Windows and macOS. The only Linux box I’m running at the moment is a Steam Deck and various servers.

bastibe · August 30, 2024, 4:09am

I think the setting only matters for interactive use, not exports.

kofa · August 30, 2024, 5:12am

Can you share the raw and the sidecar, so interested parties can repeat your test? (The ‘editing’ (darkroom rendering, I assume) test would need a 4K screen for direct comparison.)

It’d also be interesting to know if tiling was involved, and what the resource size setting was.

reox · August 30, 2024, 6:35am

I see - I tested it now interactive and still cannot see a difference between the speeds.
So I assume my setup is not really comparable to yours…

hanatos · August 30, 2024, 8:40am

processing a full-res 5792x3804 image all the way from raw to screen in vkdt in an interactive session is 19.028 ms, out of which the local laplacian filter takes 8ms on my machine. i don’t think the GPU even clocked up its cores for this. don’t want to hijack the thread, just saying. there’s something about the scheduling in darktable that is so complicated that it’s pretty much impossible to get good performance out of it (i can be rude with this darktable code because i wrote the original version, so please don’t be offended). i mean we’re talking like a 10x difference in speed while processing only a crop/downsized version of the image, or even starting from a certain module, not the full pipeline. i’m going to say this is confirmed by the observation above that a dedicated nvidia GPU doesn’t show its potential in the measured perf numbers.

bastibe · August 30, 2024, 10:55am

Absolutely:

DSCF7182.RAF (33.4 MB)
DSCF7182.RAF.xmp (12.9 KB)

I ran darktable with -d perf to measure the time for a full pixelpipe after adjusting the exposure slider. Exposure is relatively early in the pixelpipe, so that’s a bit of a “worst-case” edit.

To reproduce my setup exactly, run on a 4K screen with one peg decreased size: In Windows, that’s 175%, in macOS, that’s the 2304 x 1296 resolution (using macOS’s imaginary pixel units that have no reasonable physical meaning). UI scale in darktable is whatever it is by default.

Those are impressive numbers! I’m looking forward to running vkdt at some point!

kmilos · August 30, 2024, 11:31am

FWIW, there is a binary package already available for MSYS2 CLANGARM64. But you potentially benefited from some extra optimizations in your custom build…

kofa · August 30, 2024, 4:44pm

My GPU times (NVidia 1060 / 6GB) are consistent with yours: about 0.7 s for the ‘full’ pipeline after an edit, 1.2 s for export (my card is 2 generations behind yours).

However, the CPU times are not. A ‘full’ pipeline in the darkroom took about 0.8 s on the Ryzen 5 5600X, while your i7-12700H was about 4 times as fast. According to Intel i7-12700H vs AMD Ryzen 5 5600X [cpubenchmark.net] by PassMark Software, the performance difference should be ~20%.

Exporting took ~2.1 s, which is rather similar to what you got with the i7-12700H.

My darktable is from the master branch, self-compiled, so uses optimisations specific to the machine.

OpenCL times:
As I loaded the image for the 1st time, I got:

1459.5428 pipe finished             CL0 [full]                                  (   0/   0) 2220x1480 scale=0.3558 --> (   0/   0) 2220x1480 scale=0.3558 ID=4455
1459.5429 [dev_process_image] pixel pipeline took 0.607 secs (1.132 CPU) processing `DSCF7182.RAF'

1459.5836 pipe finished             CL0 [thumbnail]                             (   0/   0)  165x 110 scale=0.0264 --> (   0/   0)  165x 110 scale=0.0264 ID=4455
1459.5836 [dev_process_thumbnail] pixel pipeline processing took 0.612 secs (1.370 CPU)

1459.6034 pipe finished             CL0 [full]                                  (   0/   0) 2220x1480 scale=0.3558 --> (   0/   0) 2220x1480 scale=0.3558 ID=4455
1459.6035 [dev_process_image] pixel pipeline took 0.049 secs (0.325 CPU) processing `DSCF7182.RAF'
  
1459.7496 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4455
1459.7496 [dev_process_image] pixel pipeline took 0.142 secs (0.689 CPU) processing `DSCF7182.RAF'

Export (I’ve reset the export module to its defaults):

  1803.2746 pipe finished             CL0 [export]                                (   0/   0) 6240x4160 scale=1.0000 --> (   0/   0) 6240x4160 scale=1.0000 ID=4455
  1803.2746 [dev_process_export] pixel pipeline processing took 1.222 secs (6.275 CPU)

Dragging exposure by 0.001:

  1994.5935 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4455
  1994.5935 [dev_process_image] pixel pipeline took 0.149 secs (0.691 CPU) processing `DSCF7182.RAF'

  1995.1515 pipe finished             CL0 [full]                                  (   0/   0) 2220x1480 scale=0.3558 --> (   0/   0) 2220x1480 scale=0.3558 ID=4455
  1995.1515 [dev_process_image] pixel pipeline took 0.693 secs (1.694 CPU) processing `DSCF7182.RAF'

Of those 0.693 s, the Markensteijn demosaic took more than half of the time:

1994.9990 [dev_pixelpipe] took 0.386 secs (0.311 CPU) [full] processed `demosaic' on GPU, blended on GPU

CPU-only, loading the image into the darkroom:

pipe finished             CPU [full]
[dev_process_image] pixel pipeline took 0.810 secs (7.101 CPU) processing `DSCF7182.RAF'

pipe finished             CPU [preview]
[dev_process_image] pixel pipeline took 0.235 secs (1.455 CPU) processing `DSCF7182.RAF'

Exporting:

[dev_process_export] pixel pipeline processing took 2.182 secs (23.505 CPU)

bastibe · August 31, 2024, 3:10pm

Good to know, thank you!

bastibe · August 31, 2024, 3:23pm

Seeing that the M2 and Snapdragon are performing so well, whereas nominally faster CPUs run worse, I suspect a different performance metric to be the deciding factor:

Memory speed. Both the M2 and the Snapdragon have very fast memory access, (100 GB/s and 70 GB/s, respectively). This is indeed a factor of 2-4 above the bandwidth available to the other machines I tested. And the M2, specifically, has “free” CPU-GPU memory transfers, thanks to its “unified memory”, which might explain why the M2 GPU seems so much faster than the traditional GPUs

kofa · August 31, 2024, 5:46pm

My pipeline ran for 2220 x 1480 → ~3.3 MPx.

bastibe · August 31, 2024, 6:07pm

I didn’t actually check the image dimensions. Yours may well be correct.

kofa · August 31, 2024, 6:14pm

I’ve now tweaked my settings a bit.

Loading into the darkroom: 640 ms for the full, about 200 for the preview pipeline.

    11.1734 pipe finished             CL0 [full]                                  (   0/   0) 2670x1780 scale=0.4279 --> (   0/   0) 2670x1780 scale=0.4279 ID=4460

    11.1735 [dev_process_image] pixel pipeline took 0.640 secs (1.372 CPU) processing `DSCF7182.RAF'

    11.4450 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4460

    11.4451 [dev_process_image] pixel pipeline took 0.193 secs (0.677 CPU) processing `DSCF7182.RAF'

Moving exposure by a small amount, the preview pipeline (1342x 895) takes 0.1s (was previously 150 ms, but that could be measurement error), the full takes 0.3s (was about 700 ms before):

   409.2297 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4460

   409.2297 [dev_process_image] pixel pipeline took 0.106 secs (0.605 CPU) processing `DSCF7182.RAF'

   409.4283 pipe finished             CL0 [full]                                  (   0/   0) 2670x1780 scale=0.4279 --> (   0/   0) 2670x1780 scale=0.4279 ID=4460

   409.4283 [dev_process_image] pixel pipeline took 0.293 secs (1.588 CPU) processing `DSCF7182.RAF'

Export took 1.2s:

   493.5784 pipe finished             CL0 [export]                                (   0/   0) 6240x4160 scale=1.0000 --> (   0/   0) 6240x4160 scale=1.0000 ID=4460

   493.5784 [dev_process_export] pixel pipeline processing took 1.193 secs (6.291 CPU)

Edit: with the performance variables reset to their defaults, I get very similar results:

open in darkroom
    14.3599 pipe finished             CL0 [full]                                  (   0/   0) 2670x1780 scale=0.4279 --> (   0/   0) 2670x1780 scale=0.4279 ID=4460
    14.3599 [dev_process_image] pixel pipeline took 0.652 secs (1.327 CPU) processing `DSCF7182.RAF'
    
    
    14.6329 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4460
    14.6330 [dev_process_image] pixel pipeline took 0.196 secs (0.689 CPU) processing `DSCF7182.RAF'

    
adjust exposure
   164.2801 pipe finished             CL0 [full]                                  (   0/   0) 2670x1780 scale=0.4279 --> (   0/   0) 2670x1780 scale=0.4279 ID=4460
   164.2801 [dev_process_image] pixel pipeline took 0.209 secs (0.998 CPU) processing `DSCF7182.RAF'

   164.3840 pipe finished             CL0 [preview]                               (   0/   0) 1342x 895 scale=1.0000 --> (   0/   0) 1342x 895 scale=1.0000 ID=4460
   164.3840 [dev_process_image] pixel pipeline took 0.302 secs (1.582 CPU) processing `DSCF7182.RAF'

export
   247.7523 pipe finished             CL0 [export]                                (   0/   0) 6240x4160 scale=1.0000 --> (   0/   0) 6240x4160 scale=1.0000 ID=4460
   247.7523 [dev_process_export] pixel pipeline processing took 1.222 secs (6.304 CPU)