CPU SSE support.

dm30 · June 12, 2017, 1:42pm

Rawtherapee is said to support “modern” SSE instructions, but this is very vague. Could you extend on cpu-features supported. Which instruction sets are supported? Which is the most up to date one -orignal SSE dates back from 1999-? Are we talking about 128-bit vectors, 256-bit ones …? Are you working on something new in that regard?
Thank you!

heckflosse · June 12, 2017, 1:52pm

Currently it uses 128-bit vectors. You need at least SSE2 to get the advantages of vectorized code in RT. For generic x86_64 builds RT won’t use any instructions > SSE2. For native builds (e.g. if you compile rt for your machine) it supports up to SSE4.1 and also FMA for the hand written vectorized code and for the auto-vectorized code (by compiler) it supports all features of your cpu which the compiler supports.

TooWaBoo · June 12, 2017, 5:45pm

Is it enough to compile with “for my own machine” to get SSE4.1 or do I have to set it explitly?

heckflosse · June 12, 2017, 6:26pm

-DPROC_TARGET_NUMBER="2" in cmake command is enough

Edit: Just to give you a number: On my FX8350 a native build is ~10% faster for amaze demosaic than a generic build.

RawConvert · June 13, 2017, 9:32am

hi Ingo, do you think there’s a benefit to a native build with an i7-6700K please?

heckflosse · June 13, 2017, 1:00pm

I think there’s always a benefit of making your own native rt build. Your cpu supports a lot of features which can not be used in generic builds, but can be used in native builds. But it also depends on the compiler you use.
I just tried on my machine native vs. generic for the following tools on a 36 Mp image (time measured in full processing by queue, rt built using gcc 6.3.0):

applying hald clut : generic: 320 ms, native: 250 ms
Shadows/Highlights Sharp mask enabled: generic: 1350 ms, native: 1080 ms

dm30 · June 13, 2017, 6:32pm

Thank you all.
What’s the reason why can’t those optimizations be available on generic releases?

heckflosse · June 13, 2017, 7:18pm

Generic builds for a special architecture must be able to run on all machines with this architecture . E.g. generic x86/64 builds must be able to run on all x86/64 machines. The minimal supported instruction set on all x86/64 is SSE2 afaik.

rechmbrs · June 14, 2017, 8:06pm

Isn’t this compiler dependent? I was told that Intel and MS compile for various cases and uses the most efficient in execution . Compiled output files are much larger though. Is this true?
RONC

heckflosse · June 14, 2017, 8:18pm

I can’t tell about Intel and MS compilers. Maybe they generate code for various cases and use a cpu dispatcher to decide at runtime, which code to use. However, for the hand written vectorized code in RT (and there’s a lot of it) I’m quite sure Intel and MS compilers can’t use cpu dispatchers.

dm30 · June 27, 2017, 7:19pm

Sorry for the delay. I’m fixing the PC. So, the generic build is the only one guaranteed to work with any machine flawlessly, while you can compile yourself the program so compiler makes more optimized and cpu specific code.
Is not regarded as important to add more optimizations to generic build? Is it just too costly at this stage?
I apologize for being ignorant on programming matters.
Thank you all.

heckflosse · June 27, 2017, 7:31pm

No, the generic build for an architecture like x86/64 will work an any machine with this architecture, but not on machines with a different architecture (e.g. x86/32). The generic x86/64 builds will use the SSE(2) code in RT, but not the SSE4 code for example.

e.g. for using SSE4 code there are some cases in RT where the SSE4 code needs for example one instruction where the SSE2 code needs three instructions. This decision (whether to use the one SSE4 instruction or the three SSE2 instructions) has to be made at compile time. Deciding at runtime whether to use the one SSE4 instruction or the three SSE2 instructions would be slower than the gain by using the one SSE4 instruction or bloat the code a lot.

floessie · June 28, 2017, 7:22am

Ingo, GCC 4.8+ (apparently with some improvements in v6) makes this possible with Function Multiversioning. LWN has an article on that matter. Don’t know if it’s also supported with Clang. Have you already experimented with it?

heckflosse · June 28, 2017, 4:53pm

@floessie
I did not yet experiment with Function Multiversioning.

But I don’t think it makes sense for e.g. this case which is heavily used in amaze demosaic.

Of course it would be possible to make multiple versions of e.g. amaze demosaic but that’s not really maintainable

floessie · June 29, 2017, 5:45am

I see. No, that wouldn’t work.

Absolutely. If the use of vself() wasn’t so scattered, one could use FMV for larger blocks, but that’s not the case.

sTi · June 30, 2017, 7:17am

I wonder, would it make sense to offer an additional “optimized” build alongside the generic x64 one that can make use of the most common things like e.g. SSE4 available in newer CPUs? I guess most people using RT will have one of the newer i5/i7 CPU varieties, so it seems a bit strange to “cripple” the main build for the few people with very old CPUs. Or would the speed difference be very minor anyway?

gaaned92 · June 30, 2017, 7:28am

For windows, I can build for GCC “skylake” architecture (using -march=native on my machine). I will upload today to test.
edit:
I uploaded RawTherapee_dev_5.1-115-g3281332b_WinVista_64_skylake.zip
at https://drive.google.com/open?id=0B2q9OrgyDEfPMVltRXRMQ1JoMTQ built with -march=native on my machine

For those interested, following flags are generated by GCC:

-march=skylake -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-mwaitx -mno-clzero -mno-pku --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=skylake

You will find the corresponding generic build RawTherapee_dev_5.1-115-g3281332b_WinVista_64.zip at the usual place: https://drive.google.com/open?id=0B2q9OrgyDEfPS2FpdDAtMVI1RG8

james · July 1, 2017, 9:17pm

@sTi @gaaned92
I like this, luckily I’m Skylake, so will be installing this optimised build.

If anyone wonders whether their chip is Skylake or not, look here

heckflosse · July 1, 2017, 9:52pm

will be more generic and also gives a good speedup for Amaze, Shadows/Hihghlights, ciecam02 and some other tools…

RawConvert · July 28, 2017, 7:44pm

@heckflosse, hi Ingo, I finally got round to building RT to see if a native compile goes faster for my i7 6700K. And it goes much faster! - it seems. Using the same raw and PP3 (with plenty tool usage) under Ubuntu 16.10 I got these results -

RT 5.1 via the usual PPA “dhor” - 21 seconds from hitting OK on Save (Jpeg) to RT saying “Ready”
RT 5.2 Dev compiled today - 12 seconds.

Does anyone know if any speed-ups have been coded into 5.2 since 5.1 please? If not it seems like a great speed increase.

Here are the 5.2 details in case anything is odd…
Version: 5.2-5-g10822f5
Branch: 5.2-5-g10822f5
Commit: 10822f5
Commit date: 2017-07-27
Compiler: cc 6.2.0
Processor: x86_64
System: Linux
Bit depth: 64 bits
Gtkmm: V3.20.1
Build type: release
Build flags: -std=c++11 -march=native -Werror=unused-label -fopenmp -Werror=unknown-pragmas -Wall -Wno-unused-result -Wno-deprecated-declarations -O3 -DNDEBUG
Link flags: -march=native
OpenMP support: ON
MMAP support: ON

I got some error messages when I did sudo apt update - do these matter? -
W: GPG error: http://download.opensuse.org/repositories/home:/rawtherapee/xUbuntu_16.10 Release: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY AAC60BAB587A9447
E: The repository ‘http://download.opensuse.org/repositories/home:/rawtherapee/xUbuntu_16.10 Release’ is not signed.
N: Updating from such a repository can’t be done securely, and is therefore disabled by default.

The “Make” took the cpu up to 74 degrees C, I didn’t like that! I will limit the no. of cores next time!

Thanks RT people for the continuing work on this great piece of software.