So, I’ve been gradually adding things to my homebrew image editor. One of my objectives has been to make it usable on my cheap Windows tablet, and some of the more intense operation were bringing it to its knees. Many of the other image applications make a feature of using GPUs to parallelize their operations, but there’s no such thing available on the cheap tablet. But it has four real cores (Intel Atom Z3735D, 1.33Ghz clock), so I started playing with threading to spread the work on the available cores.
Cross-platform was important, so I started with pthreads with the intent to incorporate one of the win-pthread implementations. Got that to work in a command-line application, but it didn’t respond well in my wxWidgets GUI program. Switching to the wxThread class of wxWidgets solved the integration problems as well as providing a nicer class-based interface.
But the reason I’m describing all this here is to discuss the work allocation across the image. My first implementation was to clone a copy of the image, then start a number of threads equal to the number of cores available, and the first one would get the starting pixel row (stride?) of 0, and would increment through every numberofcores rows. The next thread starts at row 1, the one after that starts at row 2, etc., all incrementing the same numberofcores value. And they’d all go until they reached the image height. The processing of each pixel involves reading it from the source image, doing the work, and writing the resulting color values to the same pixel coordinates in the destination image. I saw no race conditions in the above processing, so I didn’t use any of the synchronization tools, and I haven’t gummed up anything so far, processing probably of couple of hundred images to date.
The speedup I’m getting is roughly linear; here are two representative lines from my log file, 3x3 convolution sharpen:
tool=sharpen,imagesize=4948x3280,imagebpp=24,threads=1,time=12.561099sec
tool=sharpen,imagesize=4948x3280,imagebpp=24,threads=4,time=3.681385sec
That’s a full-sized jpeg from my D7000; it goes to ludicrous speed with my web-resized image:
tool=sharpen,imagesize=640x424,imagebpp=24,threads=1,time=0.232822sec
tool=sharpen,imagesize=640x424,imagebpp=24,threads=4,time=0.089126sec
Note the smaller images aren’t so linear in speedup; I credit that to the overhead of setting up the data.
With that success, I began to wonder if I wasn’t thrashing the memory caches with my interleaved work allocation, so I wrote a version that allocated the image in contiguous chunks, and I got almost identical results. I don’t understand enough about cache-aware programming to go further, yet.
I also did significant code optimzation: in-line code, using pointers and indexes to reference pixel locations, etc., which bought a couple of seconds by itself.
Anyway, I got so excited with my results, I went and multi-threaded just about everything: all the ops ('cept for resize and crop, I’m still using the FreeImage routines), histogram construction, and image display. Well, still working on the image ops, still have grayscale and gamma to go. And so, to all you programmers, I’d like to pose the following two questions:
-
Should I worry any about image read/write synchronization? Reading from an image not being modified shouldn’t be a problem, and the threads are each working on different rows of a separate destination image, so I can’t fathom a problem there.
-
What would be the cache access concerns associated with an image loaded into contiguous memory?
Sorry for the long post, and I apologize to all the non-programmer folks. I just didn’t know where else to bring these questions.