In Phoronix’s benchmarks of the latest Intel processors, what stood out to me the most was that all the AMD processors do markedly worse than Intel in the darktable cpu-only benchmarks, compared to the RawTherapee benchmark which has a ranking much closer to what I expected for a heavy multithreaded workload.
Is this real or spurious? What might be causing AMD underperformance?
I’m not entirely sure what modules the benchmark tests. Perhaps it’s only the default modules?
i’m not really up to date with the exact characteristics of these machines. it is surprising that the 8+8 intel (24 threads) is faster than the true 16 amd with 32 threads. maybe the omp simd is finally doing something and actually uses avx512 which the amd machines maybe don’t support? maybe it’s about memory throughput and ddr5?
He was using some of the “standard”(?) benchmark images.
I think the images used are here . Timestamps within the archive are in 2016, so that would exclude the more recent modules.
i think the main point was that the difference is less big for RT . so it would be good to understand why this happens. maybe clone_target gives intel more optimized branches than AMD. or it doesnt pick an optimized branch on AMD.
Watching a few videos…seems like new i9 is pretty fast going back and forth on wins and losses with a Ryzen 9 5950x but the real interesting one from a cost point seemed to be the new i5 Intel…seems 30% faster for 10 bucks more then the Ryzen 5 series chips…so might be a good option for the budget conscious…That RT vs DT difference is interesting…
I have a Ryzen 7 5800X. I have not done real performance tests but dt performs really well even without opencl - in many cases almost as fast as with opencl.
The difference is less significant with RawTherapee.
The only comparison I have is an i7 laptop.
The question is probably really what modules were used for the test.
First of all… “darktable” from a repo’s package or “darktable” self-compiled. And in both cases, with -O2 (--build RelWithDebInfo) or with -O3 (--build Release) ?
So far, target clones are used only in my code for the tone equalizer (in the guided filter code) and some other modules, but it’s not a project-wide thing.
Anyway… benchmarks comparing apples and satellites…
-O2 discards most of the autovectorization possibilities, so for any filter that processes contiguous pixels, it can make a noticeable difference (as in -O3 can really unleash the power of AVX2). I have played with Godbolt to see what assembly code was produced and -O2 output is barely vectorized.
Ingo @heckflosse will certainly know much better but RT code has a lot of sse2 specific code that do not profit from late compiler optimisations and might not be the best code for current CPU cache and main ram architecture any more?
The latest Intels do not have AVX512 enabled though. You have to go out of your way to disable the “efficiency” cores and turn AVX512 on in bios, and not all motherboards expose that feature.
It’s not only about any avx extensions. Many algos use internal tiling, the tilesize has been chosen (it’s fixed for performance reason) with certain architecture in mind.
Yes, the cache is large. But for such large caches a different/larger tile size would be better. I have worked on that in detail for the lmmse and RCD demosaicers. And the impact is large.