LittleCMS: fast_float is Fast!

I just finished incorporating the fast_float plugin into my LittleCMS integration in rawproc. Until now, I’ve been adding a proof group (resize, sharpen) to the end of my raw processing chains so I wouldn’t have to wait long for the display transform as I dorked around with filmic parameters. Don’t need to do that anymore; even on my slow Windows tablet, the display transform on the full-sized image is now way less than 1sec.

I had to reorganize my display pipeline, my previous implementation did both the display color transform and the downsize to 8bit in the cmsDoTransform() call in a rather ill-informed attempt to speed things up. The pipeline now does the color transform in floating point, and it is thread-apportioned using the cmsDoTransformLineStride() call per Marti’s direction.

I’ve read accounts of other software doing the display transform “in-house” and I was about to do the same, but LittleCMS 2.11 with the fast_float plugin did the trick instead. Highly recommended…

(This note is for the devs; others are welcome to ask questions about it, and I’ll do my best to map it to understandable prose.)


That sounds very intersting. Can you give a pointer to the code where you used the fast_float in rawproc?

Certainly: rawproc/gimage.cpp at master · butcherg/rawproc · GitHub

Nothing too obtuse, just need to use OpenMP thread parameters to compute the apportionment.

Edit: Oh, if you find something amiss, let me know.

how much faster is that compared to the // old and less slow ?

Curious result. Processing a 16MP image on a Lenovo Miix 530 tablet, Intel(R) Core™ i3-6100U CPU @ 2.30GHz, 2301 Mhz, 2 Core(s), 4 Logical Processor(s):

fast_float w/threaded cmsDoTransformLineStride():
Tue Oct 20 14:14:07 2020 - display,time=0.260759sec
Tue Oct 20 14:14:11 2020 - display,time=0.340867sec
Tue Oct 20 14:14:21 2020 - display,time=0.399282sec
Tue Oct 20 14:14:24 2020 - display,time=0.371772sec
Tue Oct 20 14:14:29 2020 - display,time=0.330873sec
Tue Oct 20 14:14:32 2020 - display,time=0.352764sec
Tue Oct 20 14:14:32 2020 - display,time=0.329245sec

fast_float w/threaded cmsDoTransform() (// old and less slow):
Tue Oct 20 14:23:05 2020 - display,time=0.281245sec
Tue Oct 20 14:23:08 2020 - display,time=0.342542sec
Tue Oct 20 14:23:32 2020 - display,time=0.336996sec
Tue Oct 20 14:23:36 2020 - display,time=0.318500sec
Tue Oct 20 14:23:38 2020 - display,time=0.316445sec
Tue Oct 20 14:23:39 2020 - display,time=0.312637sec

threaded cmsDoTransform() (// old and less slow):
Tue Oct 20 14:24:45 2020 - display,time=0.798372sec
Tue Oct 20 14:24:49 2020 - display,time=1.973710sec
Tue Oct 20 14:24:56 2020 - display,time=1.886964sec
Tue Oct 20 14:24:58 2020 - display,time=1.940853sec
Tue Oct 20 14:25:03 2020 - display,time=1.797401sec
Tue Oct 20 14:25:10 2020 - display,time=1.623659sec
Tue Oct 20 14:25:12 2020 - display,time=1.680379sec

Using the cmsDoTransformLineStride doesn’t show any benefit with just a two-core apportionment; I just ordered the parts for an AMD Ryzen 9 3900 build, 12 cores, we’ll see on that how it might make a difference…

In all of your three examples, the first measure is the fastest. What causes that?

Though fast_float wins in every case! Does it have some downsides concerning accuracy or is it just a faster implementation?

In each of the three segments, the first display transform was for the initial image open, where an input profile wasn’t yet available. When the default toolchain was applied, its first tool in the chain is a colorspace where the camera profile is assigned. After that, there are both input and display profiles available for the display transform.

I’ve been pixel-peeping my test images processed with the fast_float variant, they still look the same. I don’t know what Marti is using, SSE, SSE2, eye of newt, etc… :smiley:

nice. more speed is always good :slight_smile:

i’m wondering, why do you need the full blown power of littlecms for a display transform? as opposed to the input device transform, an ideal output transform should be accurately expressed in a matrix (plus curves), right?

i’m colour managing the whole application in a full-window fragment shader as a very last thing to happen. it also performs dithering to hide 8-bit banding. the shader itself is nothing to speak about, both in terms of complexity or runtime (well below 1ms). i’m wondering whether the <1ms → 260ms increase in cost is justified here. are you using any “perceptual intent” etc gamut mapping in this step too?

Does it have some downsides

As far as I can see the main “downside” is the license. The fast path is GPL default lcms is LGPL.

FYI, Marti’s own benchmarks: Babl throughput comparative

LCMS is MIT, the plugin is GPL3.

1 Like

When I incorporated color management in rawproc, the lure of a library that did all the grunt math in the right order was appealing. I’ve recently considered moving the display transform “in-house” for speed, but when I got the fast_float plugin working the need dissolved…

Right now, the display transform rendering intent is hard-coded relative-colorimetric because I need to re-insert the logic to use the property-specified values in the new display pipeline code. My display pipeline is now on its third rework to make it fast enough while retaining a high-quality color transform. The LCMS fast_float plugin has dithering data types, but I’m finding if I do the color transform float->float, it’s not needed.

Nice bonus for the plugin, using MIT is usually a bad idea anyway

i see. i’m mostly concerned with float/float, too. only at some point i’ll output the data to my display. and i don’t always have xorg runnig in 10 bits, hence the dithering (don’t like the banding i’ll get otherwise, especially for synthetic gradients in renders or in very noise free skies in photography).

i could never read a lot of meaning into these rendering intents. but i think i’m doing what would be called “absolute colorimetric” (i.e. don’t touch the numbers if they don’t clip).

To be fair, @ggbutcher reported the 260 ms for a tablet using CPU processing…
Would you get the 1ms speed on that tablet using gpu processing?

i don’t know anything about tablets. but a full screen fragment shader doing pretty much nothing is really not something you pay a lot for. definitely not something more than vsync.

I think the consideration here is CPU cores vs GPU; the way OpenMP is set up to dynamically use CPU cores for threads, I can program a, say, matrix operation to use whatever cores are available on the machine upon which the program is executing, no user configuration required. And, the LittleCMS transform routines are organized that a programmer can apportion their work among the available cores with almost no consideration toward that apportionment, thanks to OpenMP. On the tablet, my program just uses the two available cores to run as many threads to do the display transform, and when I take the same program to my imminent 12-core Ryzen beast, it’ll do the same apportionment without any additional work.

I don’t know anything about GPU programming, so I can’t map the same implications there…

nice! congrats on your new machine in this case. sounds like quite the upgrade, should be fun!

i guess having worked on darktable for a decade i’m a bit sensitive to code complexity, 3rd party dependencies, and also to performance “death by 1000 cuts”. the thought to include an external library to perform a 3x3 matrix multiply does not appeal to me.

and yes, openmp is great. i moved on to do pthread pools instead, because i couldn’t figure out a way to pin my threads to cpu cores (so the caches wouldn’t thrash). also there were a few implementation dependent issues that made deployment over different platforms/compilers a little hard at times. turned out whatever features i need from a thread pool is ~300 lines of code, i think i can take this hit.

as to GPU programming. it’s a little messier to setup, but then you’ll get similar parallel processing as you would on CPU if you do a lot of wide SIMD (32x for nvidia). together with a few execution units on the device this will give you a number of parallel threads in the many thousands (exact number varies a bit with configuration).

i don’t mind writing CPU SIMD in SSE intrinsics, but i have to say the programming model for the shaders is a lot simpler to do, see for instance this colour/dithering shader:

(it’s not very compact or elegant, i should rewrite it, but you get the idea). it’s essentially written single-thread single-simd-lane, and the compilation will do the rest.

Probably need to say that I use LittleCMS for all of my color transforms as of this date. That includes the colorspace tool, with which I can insert one or more color transforms anywhere in the toolchain I desire, as well as the output/export transforms for saving to images files. The colorspace tool also can assign profiles to the internal image; my current proof processing does that as the very first step, so that the camera profile is available for any subsequent tool to display if it is selected to do so. That doesn’t work well for earlier tools, especially for pre-demosaic operations, but it’s been instructive to see the display transform “get better” as I click through tools in the chain from first to last.

I struggle in a different way with third-party dependencies; I know I can’t learn it all, so I’ve chosen to rely on libraw, littlecms, and now librtprocess for key functionality in image processing. That has let me concentrate on fleshing out a complete toolset for my uses. I’m wrestling right now with exiv2; I’ve already hand-decoded the metadta of all of my input formats, and it’s a bigger pain than one might think considering all the aspects of converting to a different metadata structure. I still may decide not to include exiv2 for the upcoming 1.0 rawproc.

Being a distributed computing geek of sorts, I want to eventually figure out GPU programming. Given that I’ve #pragma omp -ed just about every place in rawproc that makes some sense, the imperative to do that is not high right now. And then, I’m going to want to make the user burden for using such as small as possible, so that’ll take some head-scratching…

oh that sounds really interesting. do you have a pointer to your code for this?

right. as of today using GPUs too much has quite some setup cost for users. which is why we do this silly dance and dlopen() all the opencl callbacks manually in darktable, so you can run it even without linking to opencl. i’m kindof hoping that standard opengl/vulkan features with not too esoteric requirements on beta drivers will be widely available in the future (maybe not on macintosh, but well). i mean even smartphones have it nowadays.