Processing Cores and Scaling

I recently built a new desktop computer with an AMD Ryzen 9 3900X 12-core processor and 16GB RAM. It wasn’t that I was dissatisfied with the 4-core Phenom II, but I wanted to mess a bit with processing scaling, e.g., how do the returns diminish as cores are thrown to the task. So, after a harrowing experience with thermal paste and a mis-seated memory module I now have a blisteringly-fast desktop.

Building a scaling test turned out to be rather simple; when I incorporated OpenMP multithreading in rawproc and img, I defaulted the configuration of all the image processing operations to parallel-ize with the maximum number of available threads, and I included logging and messaging to report that. So, I wrote a simple img processing chain to take a raw file to linear tiff, surrounded it in a for loop which did a export OMP_NUM_THREADS=$i before each img invocation, about 15min of coding. To put significant processing in the mix, I grepped for the denoise message where the nlmeans algorithm is selected, and then ran that script with three raws: 1) 16MP image from my D7000, 2) 24MP image from my Z 6, and 3) a 47MP image from a D850 raw downloaded from DPReview. In that order, here’s the result output:

glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSG_3111.NEF 
denoise:nlmeans...  (2 threads, 16.585763sec)
denoise:nlmeans...  (4 threads, 8.473911sec)
denoise:nlmeans...  (6 threads, 5.770318sec)
denoise:nlmeans...  (8 threads, 4.381469sec)
denoise:nlmeans...  (10 threads, 3.549655sec)
denoise:nlmeans...  (12 threads, 3.009535sec)
denoise:nlmeans...  (14 threads, 3.170146sec)
denoise:nlmeans...  (16 threads, 2.775785sec)
denoise:nlmeans...  (18 threads, 3.286111sec)
denoise:nlmeans...  (20 threads, 2.379996sec)
denoise:nlmeans...  (22 threads, 2.217976sec)
denoise:nlmeans...  (24 threads, 2.019165sec)
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSZ_4168.NEF 
denoise:nlmeans...  (2 threads, 24.825491sec)
denoise:nlmeans...  (4 threads, 12.896952sec)
denoise:nlmeans...  (6 threads, 8.788920sec)
denoise:nlmeans...  (8 threads, 6.591874sec)
denoise:nlmeans...  (10 threads, 5.367246sec)
denoise:nlmeans...  (12 threads, 4.556160sec)
denoise:nlmeans...  (14 threads, 4.809256sec)
denoise:nlmeans...  (16 threads, 4.278626sec)
denoise:nlmeans...  (18 threads, 3.939623sec)
denoise:nlmeans...  (20 threads, 3.554366sec)
denoise:nlmeans...  (22 threads, 3.314698sec)
denoise:nlmeans...  (24 threads, 3.036885sec)
glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSC_0312.NEF 
denoise:nlmeans...  (2 threads, 46.687434sec)
denoise:nlmeans...  (4 threads, 23.874114sec)
denoise:nlmeans...  (6 threads, 16.110012sec)
denoise:nlmeans...  (8 threads, 12.475774sec)
denoise:nlmeans...  (10 threads, 10.004887sec)
denoise:nlmeans...  (12 threads, 8.359038sec)
denoise:nlmeans...  (14 threads, 8.882417sec)
denoise:nlmeans...  (16 threads, 7.795642sec)
denoise:nlmeans...  (18 threads, 7.205934sec)
denoise:nlmeans...  (20 threads, 6.613540sec)
denoise:nlmeans...  (22 threads, 6.146585sec)
denoise:nlmeans...  (24 threads, 5.675327sec)

Of note is the nlmeans algorithm I Implemented was a code conversion from a G’MIC implementation presented by @David_Tschumperle in his blog; quite inefficient, but I’ve been too lazy to figure out one of the published optimizations.

I haven’t done any quantitative analysis on this, just some eyeball consideration. The main observation I think is pertinent is the image size is important to the selection of a suitable processor. If one stays in the 24MP range, an 6- or 8-core CPU might be quite enough; if one has or aspires to a 40+MP camera, more is probably better.

Surprising to me was the incremental speedup between threads, which are a logical construct managed by the CPU over the number of cores. My CPU has 12 cores and 24 threads, which means in a 24-thread parallel campaign, each physical core is running two threads. I really expected to see a stair-step pattern to the speedup curve, but there was significant speedup in each increment:

glenn@bena:~/ImageStuff/corescaling$ ./proctest.sh DSG_3111.NEF 
denoise:nlmeans...  (2 threads, 16.505298sec)
denoise:nlmeans...  (3 threads, 11.212582sec)
denoise:nlmeans...  (4 threads, 8.533492sec)
denoise:nlmeans...  (5 threads, 6.808916sec)
denoise:nlmeans...  (6 threads, 5.801307sec)
denoise:nlmeans...  (7 threads, 4.961288sec)
denoise:nlmeans...  (8 threads, 4.428086sec)
denoise:nlmeans...  (9 threads, 3.971619sec)
denoise:nlmeans...  (10 threads, 3.571562sec)
denoise:nlmeans...  (11 threads, 3.280610sec)
denoise:nlmeans...  (12 threads, 3.000414sec)
denoise:nlmeans...  (13 threads, 3.415608sec)
denoise:nlmeans...  (14 threads, 3.167845sec)
denoise:nlmeans...  (15 threads, 2.993138sec)
denoise:nlmeans...  (16 threads, 2.784979sec)
denoise:nlmeans...  (17 threads, 2.677216sec)
denoise:nlmeans...  (18 threads, 2.586783sec)
denoise:nlmeans...  (19 threads, 2.479343sec)
denoise:nlmeans...  (20 threads, 2.382732sec)
denoise:nlmeans...  (21 threads, 2.282796sec)
denoise:nlmeans...  (22 threads, 2.186075sec)
denoise:nlmeans...  (23 threads, 2.110913sec)
denoise:nlmeans...  (24 threads, 2.029041sec)

Now, this all is entwined in the cores-cache-thread-MhZ mix. But, the lazy programmer (me, that is…) is just going to prepend all his for-loops with #pragma omp fors and let OpenMP figure out how to lay the work upon the computer presented to it. I just thought I’d get a simple sense of scaling, and present it here for consideration…

5 Likes

Observation: You actually slowed down slightly when you transitioned from 1 thread per core to just over 1. But you made it up by the time you hit 15 threads.

Going from 2 threads to 12 (6x as many threads) was 5.33x faster

Going from 12 to 24 (2x as many threads) was 1.5x faster - I’m actually kind of surprised that you got that much improvement from multithreading. (I have never looked much at nlmeans - possibly branch-heavy/prone to pipeline stalls?)

Now try changing your ram speed. 3600 or 3733 “MHz” are best for Ryzen 3000.

As far as I know, latency isn’t hugely important for image processing.

I bought DDR4-3200 RAM, the cheapest…

I was wondering more about the “distance” through the cache. One day, I’m going to instrument a rawproc tool to deliver per-thread timing, see what if there’s a correlatable differnce to cache hits…

Yes, this was just a single run of each, no telling what the other processes on the computer were doing. I was going to clean it off, but decided, who does that when they post-process?

nlmeans in its original incaration is a loop within a loop, within the two loops that walk the width and height of the image.

… slightly off topic, but if you’re worried about nlmeans speed and cache performance, this is how it’s done: http://www.iacl.ece.jhu.edu/proceedings/isbi2008/pdfs/0001331.pdf . this is what made it’s way into darktable/vkdt, too. their table 2 looks much like yours above. does not quite scale linearly with the number of cores (but works on GPU). also they got pictures of what looks like coronaviruses, quite a visionary paper :slight_smile:

1 Like

Thanks a bunch; I’m trying to tie up rawproc 1.0, so I’m going to give it a go for 1.1…

Oh, something I forgot to capture in the original post: I did the analysis with a img executable that I’d copied over from the old four-core Athalon machine; with all the magic of OpenMP, img just found the 12 cores of the Ryzen machine and used them. No configuration changes. Probably won’t ever be as fast as a well-oiled GPU pipeline, but there’s something to be said for “works out-of-the-box”…