I recently built a new desktop computer with an AMD Ryzen 9 3900X 12-core processor and 16GB RAM. It wasn’t that I was dissatisfied with the 4-core Phenom II, but I wanted to mess a bit with processing scaling, e.g., how do the returns diminish as cores are thrown to the task. So, after a harrowing experience with thermal paste and a mis-seated memory module I now have a blisteringly-fast desktop.
Building a scaling test turned out to be rather simple; when I incorporated OpenMP multithreading in rawproc and img, I defaulted the configuration of all the image processing operations to parallel-ize with the maximum number of available threads, and I included logging and messaging to report that. So, I wrote a simple img processing chain to take a raw file to linear tiff, surrounded it in a for loop which did a export OMP_NUM_THREADS=$i
before each img invocation, about 15min of coding. To put significant processing in the mix, I grepped for the denoise message where the nlmeans algorithm is selected, and then ran that script with three raws: 1) 16MP image from my D7000, 2) 24MP image from my Z 6, and 3) a 47MP image from a D850 raw downloaded from DPReview. In that order, here’s the result output:
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSG_3111.NEF
denoise:nlmeans... (2 threads, 16.585763sec)
denoise:nlmeans... (4 threads, 8.473911sec)
denoise:nlmeans... (6 threads, 5.770318sec)
denoise:nlmeans... (8 threads, 4.381469sec)
denoise:nlmeans... (10 threads, 3.549655sec)
denoise:nlmeans... (12 threads, 3.009535sec)
denoise:nlmeans... (14 threads, 3.170146sec)
denoise:nlmeans... (16 threads, 2.775785sec)
denoise:nlmeans... (18 threads, 3.286111sec)
denoise:nlmeans... (20 threads, 2.379996sec)
denoise:nlmeans... (22 threads, 2.217976sec)
denoise:nlmeans... (24 threads, 2.019165sec)
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSZ_4168.NEF
denoise:nlmeans... (2 threads, 24.825491sec)
denoise:nlmeans... (4 threads, 12.896952sec)
denoise:nlmeans... (6 threads, 8.788920sec)
denoise:nlmeans... (8 threads, 6.591874sec)
denoise:nlmeans... (10 threads, 5.367246sec)
denoise:nlmeans... (12 threads, 4.556160sec)
denoise:nlmeans... (14 threads, 4.809256sec)
denoise:nlmeans... (16 threads, 4.278626sec)
denoise:nlmeans... (18 threads, 3.939623sec)
denoise:nlmeans... (20 threads, 3.554366sec)
denoise:nlmeans... (22 threads, 3.314698sec)
denoise:nlmeans... (24 threads, 3.036885sec)
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSC_0312.NEF
denoise:nlmeans... (2 threads, 46.687434sec)
denoise:nlmeans... (4 threads, 23.874114sec)
denoise:nlmeans... (6 threads, 16.110012sec)
denoise:nlmeans... (8 threads, 12.475774sec)
denoise:nlmeans... (10 threads, 10.004887sec)
denoise:nlmeans... (12 threads, 8.359038sec)
denoise:nlmeans... (14 threads, 8.882417sec)
denoise:nlmeans... (16 threads, 7.795642sec)
denoise:nlmeans... (18 threads, 7.205934sec)
denoise:nlmeans... (20 threads, 6.613540sec)
denoise:nlmeans... (22 threads, 6.146585sec)
denoise:nlmeans... (24 threads, 5.675327sec)
Of note is the nlmeans algorithm I Implemented was a code conversion from a G’MIC implementation presented by @David_Tschumperle in his blog; quite inefficient, but I’ve been too lazy to figure out one of the published optimizations.
I haven’t done any quantitative analysis on this, just some eyeball consideration. The main observation I think is pertinent is the image size is important to the selection of a suitable processor. If one stays in the 24MP range, an 6- or 8-core CPU might be quite enough; if one has or aspires to a 40+MP camera, more is probably better.
Surprising to me was the incremental speedup between threads, which are a logical construct managed by the CPU over the number of cores. My CPU has 12 cores and 24 threads, which means in a 24-thread parallel campaign, each physical core is running two threads. I really expected to see a stair-step pattern to the speedup curve, but there was significant speedup in each increment:
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSG_3111.NEF
denoise:nlmeans... (2 threads, 16.505298sec)
denoise:nlmeans... (3 threads, 11.212582sec)
denoise:nlmeans... (4 threads, 8.533492sec)
denoise:nlmeans... (5 threads, 6.808916sec)
denoise:nlmeans... (6 threads, 5.801307sec)
denoise:nlmeans... (7 threads, 4.961288sec)
denoise:nlmeans... (8 threads, 4.428086sec)
denoise:nlmeans... (9 threads, 3.971619sec)
denoise:nlmeans... (10 threads, 3.571562sec)
denoise:nlmeans... (11 threads, 3.280610sec)
denoise:nlmeans... (12 threads, 3.000414sec)
denoise:nlmeans... (13 threads, 3.415608sec)
denoise:nlmeans... (14 threads, 3.167845sec)
denoise:nlmeans... (15 threads, 2.993138sec)
denoise:nlmeans... (16 threads, 2.784979sec)
denoise:nlmeans... (17 threads, 2.677216sec)
denoise:nlmeans... (18 threads, 2.586783sec)
denoise:nlmeans... (19 threads, 2.479343sec)
denoise:nlmeans... (20 threads, 2.382732sec)
denoise:nlmeans... (21 threads, 2.282796sec)
denoise:nlmeans... (22 threads, 2.186075sec)
denoise:nlmeans... (23 threads, 2.110913sec)
denoise:nlmeans... (24 threads, 2.029041sec)
Now, this all is entwined in the cores-cache-thread-MhZ mix. But, the lazy programmer (me, that is…) is just going to prepend all his for-loops with #pragma omp fors
and let OpenMP figure out how to lay the work upon the computer presented to it. I just thought I’d get a simple sense of scaling, and present it here for consideration…