darktable and AMD performance in benchmarks

https://www.phoronix.com/scan.php?page=article&item=intel-12600k-12900k&num=9

In Phoronix’s benchmarks of the latest Intel processors, what stood out to me the most was that all the AMD processors do markedly worse than Intel in the darktable cpu-only benchmarks, compared to the RawTherapee benchmark which has a ranking much closer to what I expected for a heavy multithreaded workload.

Is this real or spurious? What might be causing AMD underperformance?

I’m not entirely sure what modules the benchmark tests. Perhaps it’s only the default modules?

maybe the clone_target stuff was more optimal for intel than for AMD?
also we would need to know how they build the binary used in the benchmark.

i’m not really up to date with the exact characteristics of these machines. it is surprising that the 8+8 intel (24 threads) is faster than the true 16 amd with 32 threads. maybe the omp simd is finally doing something and actually uses avx512 which the amd machines maybe don’t support? maybe it’s about memory throughput and ddr5?

He was using some of the “standard”(?) benchmark images.
I think the images used are here . Timestamps within the archive are in 2016, so that would exclude the more recent modules.

assests != binaries

i think the main point was that the difference is less big for RT . so it would be good to understand why this happens. maybe clone_target gives intel more optimized branches than AMD. or it doesnt pick an optimized branch on AMD.

1 Like

True. I was more replying to @CarVac 's remark about

Watching a few videos…seems like new i9 is pretty fast going back and forth on wins and losses with a Ryzen 9 5950x but the real interesting one from a cost point seemed to be the new i5 Intel…seems 30% faster for 10 bucks more then the Ryzen 5 series chips…so might be a good option for the budget conscious…That RT vs DT difference is interesting…

I have a Ryzen 7 5800X. I have not done real performance tests but dt performs really well even without opencl - in many cases almost as fast as with opencl.
The difference is less significant with RawTherapee.
The only comparison I have is an i7 laptop.
The question is probably really what modules were used for the test.

Maybe we shouldn’t assume this is something wrong with darktable, but rather the result of @heckflosse’s optimization wizardry on RawTherapee.

3 Likes

First of all… “darktable” from a repo’s package or “darktable” self-compiled. And in both cases, with -O2 (--build RelWithDebInfo) or with -O3 (--build Release) ?

So far, target clones are used only in my code for the tone equalizer (in the guided filter code) and some other modules, but it’s not a project-wide thing.

Anyway… benchmarks comparing apples and satellites…

Is the performance dramatically different?

I’ve never noticed significant changes in Filmulator between the two.

-O2 discards most of the autovectorization possibilities, so for any filter that processes contiguous pixels, it can make a noticeable difference (as in -O3 can really unleash the power of AVX2). I have played with Godbolt to see what assembly code was produced and -O2 output is barely vectorized.

1 Like

do we set -O3 in the cmake files?

Ingo @heckflosse will certainly know much better but RT code has a lot of sse2 specific code that do not profit from late compiler optimisations and might not be the best code for current CPU cache and main ram architecture any more?

1 Like

that was my impression too, hence the reference to avx512 above.

1 Like

The latest Intels do not have AVX512 enabled though. You have to go out of your way to disable the “efficiency” cores and turn AVX512 on in bios, and not all motherboards expose that feature.

It’s not only about any avx extensions. Many algos use internal tiling, the tilesize has been chosen (it’s fixed for performance reason) with certain architecture in mind.

2 Likes

That wouldn’t disadvantage AMD though, since it has larger L3 cache sizes than 11th gen Intel (and same as the 12 series).

1 Like

Yes, the cache is large. But for such large caches a different/larger tile size would be better. I have worked on that in detail for the lmmse and RCD demosaicers. And the impact is large.