Multi-Threaded Image Processing

ggbutcher · May 5, 2016, 1:43am

So, I’ve been gradually adding things to my homebrew image editor. One of my objectives has been to make it usable on my cheap Windows tablet, and some of the more intense operation were bringing it to its knees. Many of the other image applications make a feature of using GPUs to parallelize their operations, but there’s no such thing available on the cheap tablet. But it has four real cores (Intel Atom Z3735D, 1.33Ghz clock), so I started playing with threading to spread the work on the available cores.

Cross-platform was important, so I started with pthreads with the intent to incorporate one of the win-pthread implementations. Got that to work in a command-line application, but it didn’t respond well in my wxWidgets GUI program. Switching to the wxThread class of wxWidgets solved the integration problems as well as providing a nicer class-based interface.

But the reason I’m describing all this here is to discuss the work allocation across the image. My first implementation was to clone a copy of the image, then start a number of threads equal to the number of cores available, and the first one would get the starting pixel row (stride?) of 0, and would increment through every numberofcores rows. The next thread starts at row 1, the one after that starts at row 2, etc., all incrementing the same numberofcores value. And they’d all go until they reached the image height. The processing of each pixel involves reading it from the source image, doing the work, and writing the resulting color values to the same pixel coordinates in the destination image. I saw no race conditions in the above processing, so I didn’t use any of the synchronization tools, and I haven’t gummed up anything so far, processing probably of couple of hundred images to date.

The speedup I’m getting is roughly linear; here are two representative lines from my log file, 3x3 convolution sharpen:

tool=sharpen,imagesize=4948x3280,imagebpp=24,threads=1,time=12.561099sec
tool=sharpen,imagesize=4948x3280,imagebpp=24,threads=4,time=3.681385sec

That’s a full-sized jpeg from my D7000; it goes to ludicrous speed with my web-resized image:

tool=sharpen,imagesize=640x424,imagebpp=24,threads=1,time=0.232822sec
tool=sharpen,imagesize=640x424,imagebpp=24,threads=4,time=0.089126sec

Note the smaller images aren’t so linear in speedup; I credit that to the overhead of setting up the data.

With that success, I began to wonder if I wasn’t thrashing the memory caches with my interleaved work allocation, so I wrote a version that allocated the image in contiguous chunks, and I got almost identical results. I don’t understand enough about cache-aware programming to go further, yet.

I also did significant code optimzation: in-line code, using pointers and indexes to reference pixel locations, etc., which bought a couple of seconds by itself.

Anyway, I got so excited with my results, I went and multi-threaded just about everything: all the ops ('cept for resize and crop, I’m still using the FreeImage routines), histogram construction, and image display. Well, still working on the image ops, still have grayscale and gamma to go. And so, to all you programmers, I’d like to pose the following two questions:

Should I worry any about image read/write synchronization? Reading from an image not being modified shouldn’t be a problem, and the threads are each working on different rows of a separate destination image, so I can’t fathom a problem there.
What would be the cache access concerns associated with an image loaded into contiguous memory?

Sorry for the long post, and I apologize to all the non-programmer folks. I just didn’t know where else to bring these questions.

CarVac · May 5, 2016, 2:05am

You’re using actual threads?

Why not just use OpenMP? That lets you just drop a #pragma or two in front of a for loop and have it be parallelized neatly.

See this for a short example:

github.com

CarVac/filmulator-gui/blob/master/filmulator-gui/core/exposure.cpp

/* 
 * This file is part of Filmulator.
 *
 * Copyright 2013 Omer Mano and Carlo Vaccari
 *
 * Filmulator is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * Filmulator is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with Filmulator. If not, see <http://www.gnu.org/licenses/>
 */
#include "filmSim.hpp"

This file has been truncated. show original

All you do is declare as `shared’ anything that’s not const or instantiated inside the loop, (assuring it that it’s okay to touch these even though other loop iterations may be also touching them at the same time) and then tell it to perform the for loop in parallel.

As for caching concerns, each individual row in an image is typically dozens of cache lines wide, and modern processors can prefetch them, so there’s no real benefit for doing large chunks of work versus individual rows.

Also you mentioned “real cores” as opposed to, I presume, hyperthreading? Depending on the operation, hyperthreading can give up to a full 2x speedup. The blur step in Filmulation (a 2d 4-tap IIR convolution) when compared to the develop step (point-wise lots-of-multiplies-and-additions), is twice as fast on my i7 laptop versus my i5 desktop. Either the develop step is too complicated a series of instructions for it to run twice at once, or each fully utilizes all of the ALU resources and there’s no benefit from extra threads? I’m not exactly sure the mechanism of hyperthreading nowadays, what’s duplicated and what’s not.

ggbutcher · May 5, 2016, 2:24am

Probably familiarity more than anything. I’ve both done a fair bit of thread programming, and I’ve worked a few failure investigations where the root cause turned out to be “creative” use of threads - poorly organized producer-consumer schemes, and such.

There was also the desire to strip out any unnecessary processing - function and procedure calls are insidious in that regard. Using the lowest-level library available to manage threading made sense in that regard; I’d prefer to be using pthreads or native Windows threads, but wxWidgets convinced me to slack up in that aspect. I also did really simple things like split the pixel location calculation so the x part was done in the x loop and the y part in the y loop, and that made about two-tenths of a sec difference in the convolve operation on the full-sized image. I gave a couple of seconds of thought to in-line assembler, but talked myself out of that real quick. It became a challenge to find every place to squeeze a few cycles out of the processing.

ggbutcher · May 5, 2016, 2:42am

WRT hyperthreading, yep, the Atom processor I’m using has four real cores, no hyperthread mechanisms. My desktop computer is a quad-core AMD Phenom, no hyperthreading there. I simply ask the library for the number of available cores, which on Intel systems will be the number of hyperthreads. My image operations are mostly LUTs, not much math going on there.

I’ve done a few runs with threadcounts higher than available cores, but I didn’t see a slowdown of speedup immediately past 4, not until 6 or 7. Need to do a more organized test.

As for caching concerns, each individual row in an image is typically dozens of cache
lines wide, and modern processors can prefetch them, so there’s no real benefit for doing
large chunks of work versus individual rows.

That’s one of the things I suspected, and might be why my speedup on the smaller images is less.

heckflosse · May 5, 2016, 10:22am

On NUMA systems using larger chunks per thread might be preferable.

Jan_Moren · May 5, 2016, 11:31am

As a general rule, just try to access contiguous memory as much as possible. Each core will have its own L1 cache so you shouldn’t see many issues with threads stomping on each other. In fact, if you are doing stencil operations (where each result value depends on values “around it” in a grid), then having the threads interleaved may actually help you since they collectively prefill the L3 cache with the same data and all use it together.

There are a lot of smaller optimizations you could do. Block access instead of linear would typically help with some types of operations; partial loop unrolling and instruction reordering can also help a lot sometimes. But that sort of thing really normally isn’t worth it. Your code becomes convoluted and difficult to maintain, and you really have to do this for the specific target architecture, and you aim for portability.

One thing that can give you a substantial boost, though, is SSE instructions and the like. That is, using vector instructions to process 8 pixels at a time in each core. Yes, it also uglifies your code and need to be written with specific CPUs in mind, but that can give you a really big effect for some workloads.

ggbutcher · May 6, 2016, 2:00am

The Grand Experiment

CarVac, you piqued my interest in OpenMP, so I read up on it over lunch, and figured out a shootout test that wouldn’t take too long to construct. Believe me, I’m all about easy, and they’ve definitely made OpenMP easy.

Firstly, had to make sure my compiler had the libraries. It’s GCC 4.7, but it was missing the libgomp stuff, boo. However, it was trivial to find it, download it, and install it over my mingw gcc compiler. Wrote a test hello world program, compiled it, and got four Hello!s, one per core. Yay!

My threading is implemented with the wxWidgets wxThread class, so I have ThreadedCurve, ThreadedSaturation, ThreadedHistogram, etc. classes. Over the weekend, I wrote a static ApplyCurve method in ThreadedCurve, because I was using it in many tools and I didn’t want to put the thread spawn loops in each tool. That helped me with this test; I just took ApplyCurve and implemented ApplyCurveOMP, another static method into which I copied the for statement that loops over the image and added the ‘#pragma omp for’ - that in itself saved tens of lines of code to set up the stride interleaving.

I then went to my curve tool and implemented a configuration file-based if to select either the old ApplyCurve or the new ApplyCurveOMP, added some logging, and did a few curve applications on an image. Here’s the result

Thu May 05 20:33:43 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.706570sec
Thu May 05 20:33:45 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.659999sec
Thu May 05 20:33:48 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.682357sec
Thu May 05 20:33:51 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.687961sec
Thu May 05 20:33:53 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.671852sec
Thu May 05 20:33:55 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.742699sec
Thu May 05 20:33:59 2016 - tool=curve(openmp),imagesize=4948x3280,imagebpp=24,threads=4,time=0.741728sec
Thu May 05 20:34:22 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.288799sec
Thu May 05 20:34:24 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.268799sec
Thu May 05 20:34:27 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.267447sec
Thu May 05 20:34:29 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.268733sec
Thu May 05 20:34:31 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.266260sec
Thu May 05 20:34:33 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.265661sec
Thu May 05 20:34:38 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.269596sec
Thu May 05 20:34:42 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.265873sec
Thu May 05 20:34:44 2016 - tool=curve(wxthread),imagesize=4948x3280,imagebpp=24,threads=4,time=0.266198sec

So, OpenMP comes in a little more than twice as slow. I was expecting it to be slower, but not that much; it would have seemed to me the extra overhead of OpenMP setting up the threading would have been small compared to my setup, and the size of the processed image would have dominated the time, but no…

Still, you showed me something very useful. I used to teach distributed systems in college, we did threads the hard way, and tools like OpenMP were still not well integrated. I’ve not kept up with the tech, so this was a pleasant surprise; I didn’t even have to #include a library.

I’m already past the complexity barrier, so I’m going to keep my threaded implementation. However, for anyone who’s still single-threaded, OpenMP would be a trivially easy speedup; no changed code, just add the pragma before the image loops, and add the -fopenmp compiler flag. Like a certain lawyer’s ads in my hometown, “It’s Just That Easy”.

David_Tschumperle · May 7, 2016, 4:33pm

Yes OpenMP is an awesome API to make parallel implementations simple, I second that. This is what I use in G’MIC and CImg to parallelize all the algorithms.

heckflosse · May 8, 2016, 10:57pm

Some times it isn’t just that easy. That may be the reason why you don’t get the expected speedup in your test scenario. Show us the code of your test and we’ll see

ggbutcher · May 11, 2016, 12:58am

Ingo,

The code is part of the rawproc project at github:

Switch to the openmp branch, then look at ThreadedCurve.h and ThreadedCurve.cpp. This class is a child of wxThread, the wxWidgets thread class. It applies a lookup table (LUT) based on a vector containing a set of control points, converted to LUT values with a spline algorithm. You’ll note there are other Threaded… classes, but this is the only one with a static ApplyCurve() method that kicks off all the threads; for the others, that work is done by the using tool. ApplyCurve() has two for-loops, one to create and run the ThreadedCurve instances, and a second one to wait for their completion, a classic thread pattern. The first for-loop passes startrow and increment values that interleave the threads amongst the rows.

So, for the experiment, I made another static method, ApplyCurveOMP(). It essentially hard-codes the LUT for loops in the method without using the wxThread-based class. For OpenMP, I placed the #pragma omp on top of the outside for-loop, which loops over the image rows. I’m assuming OpenMP breaks the for increments into N contiguous ranges, N corresponding to the number of available cores.

There are some differences in where work is done between the two approaches; ApplyCurve() has all the image access calls happening in each thread, where ApplyCurveOMP() does them once and all the threads see the same variables in their scopes. But, that should give the advantage to OpenMP.

Timing is done by the calling routine, using mark() and duration() routines in util.h/util.cpp. You can see the calls in PicProcessorCurve.cpp::processPic(). For the experiment, I put the two calls in an if-then condtioned on a configuration variable.

If you see something I should be adding to the pragma, or any other thing that would level the playing field, let me know and I’ll rerun. I’m curious myself, because the processing differences I can intuit wouldn’t account for the timing differences measured.

heckflosse · May 11, 2016, 11:17am

Glenn, you have 2 completely different access patterns here. Let c be the number of threads:

in your threaded implementation thread x works on row x, x+c, x+2c, x+3c and so on…
In the omp implementation thread x works roughly on row xh/c to (x+1)(h/c) - 1

That may explain the differences. Interesting that the access pattern of threaded implementation is more than twice as fast than the access pattern of omp implementation. Did you measure the bpp == 48 or the bpp == 24 part?

ggbutcher · May 11, 2016, 11:14pm

24bpp.

Earlier, I did a threaded test that used contiguous chunks, mainly to see if that affected cache coherence. There was no discernable difference from the interleaved version, I think for the reason CarVac elucidated: dozens of cache lines per row. I can resurrect the code from its git branch, make an ApplyCurveChunks()…

Next week, I’ll have the evenings to conduct a proper test: combinations of bpp, image size, number of threads, openmp/threading, with run quantities suitable for descriptive statistics. However, it’ll all be on my cheap Windows tablet. If I can squeeze in a few minutes on my desktop this week, I’m at least going to do some AMD 4-core runs to compare with the ones my son did on his 6-core AMD machine. Particularly, I’d like to segregate the setup work from the image work.

Anyway, I think I’ve gotten a good idea of the answer to my second question from discussion, not much to worry regarding cache coherence. But my first question still lurks (need for read/write memory protection), but I have hundreds of runs that indicate it’s not needed if the threads reliably work on different parts of the image.

heckflosse · May 12, 2016, 11:27pm

Glenn, I think I found a bug in your omp implementation which makes it slow and also leads to race conditions. By default variables declared outside the parallel region are shared among the threads. You should declare at least x and wdstpix (for bpp == 48) inside the parallel region or declare them as private in the #pragma.
Example:

#pragma omp parallel for for(unsigned y = 0; y < h; y++) { for(unsigned x = 0; x < w; x++) { FIRGB16 *wdstpix = (FIRGB16 *) (dstbits + dpitch*y + 6*x); .....
or

#pragma omp parallel for private(x,wdstpix) for(y = 0; y < h; y++) { for(x = 0; x < w; x++) { wdstpix = (FIRGB16 *) (dstbits + dpitch*y + 6*x); .....

ggbutcher · May 13, 2016, 4:08am

So I made the changes you recommended, and there was no change in the time, ~0.8sec. That obviously didn’t look right, thought about it for a bit, then set number of cores = 1, and no change. So, the pragma isn’t working, an apparent feature of OpenMP, the pragma is essentially invisible to the compiler if -fopenmp isn’t present. And pretty much, that was my problem, -fopenmp not in the compile target. Put it in, and the curve times dropped to within 1/100sec of the wxthread design.

So, to complete the inquisition, I backed out the variable localization and did a few curves; not only did the times go to to almost 3sec, the image was seriously messed up, rampant memory access conflicts. I think the only reason the image was recognizable at all is because I clone a copy of the original image to use as the destination. So my variable scope was a bug in every sense.

So if I had it to do over, I’d have used OpenMP from the start. I may yet switch, my command line program uses pthreads, which is not easily pulled into a wxWidgets GUI program, and i’d like to have one library to include in both.

heckflosse · May 13, 2016, 10:41am

Glenn, if you switch to OpenMp, -Werror=unknown-pragmas is your friend to immediately detect typos like #pragma omp paralel, which will happen from time to time, too