CPU SSE support.

(Dm30) #1

Rawtherapee is said to support “modern” SSE instructions, but this is very vague. Could you extend on cpu-features supported. Which instruction sets are supported? Which is the most up to date one -orignal SSE dates back from 1999-? Are we talking about 128-bit vectors, 256-bit ones …? Are you working on something new in that regard?
Thank you!

(Ingo Weyrich) #2

Currently it uses 128-bit vectors. You need at least SSE2 to get the advantages of vectorized code in RT. For generic x86_64 builds RT won’t use any instructions > SSE2. For native builds (e.g. if you compile rt for your machine) it supports up to SSE4.1 and also FMA for the hand written vectorized code and for the auto-vectorized code (by compiler) it supports all features of your cpu which the compiler supports.


Is it enough to compile with “for my own machine” to get SSE4.1 or do I have to set it explitly?

(Ingo Weyrich) #4

-DPROC_TARGET_NUMBER="2" in cmake command is enough

Edit: Just to give you a number: On my FX8350 a native build is ~10% faster for amaze demosaic than a generic build.

(Andrew) #5

hi Ingo, do you think there’s a benefit to a native build with an i7-6700K please?

(Ingo Weyrich) #6

I think there’s always a benefit of making your own native rt build. Your cpu supports a lot of features which can not be used in generic builds, but can be used in native builds. But it also depends on the compiler you use.
I just tried on my machine native vs. generic for the following tools on a 36 Mp image (time measured in full processing by queue, rt built using gcc 6.3.0):

  1. applying hald clut : generic: 320 ms, native: 250 ms
  2. Shadows/Highlights Sharp mask enabled: generic: 1350 ms, native: 1080 ms

(Dm30) #7

Thank you all.
What’s the reason why can’t those optimizations be available on generic releases?

(Ingo Weyrich) #8

Generic builds for a special architecture must be able to run on all machines with this architecture . E.g. generic x86/64 builds must be able to run on all x86/64 machines. The minimal supported instruction set on all x86/64 is SSE2 afaik.

(Ronald E Chambers) #9

Isn’t this compiler dependent? I was told that Intel and MS compile for various cases and uses the most efficient in execution . Compiled output files are much larger though. Is this true?

(Ingo Weyrich) #10

I can’t tell about Intel and MS compilers. Maybe they generate code for various cases and use a cpu dispatcher to decide at runtime, which code to use. However, for the hand written vectorized code in RT (and there’s a lot of it) I’m quite sure Intel and MS compilers can’t use cpu dispatchers.

(Dm30) #11

Sorry for the delay. I’m fixing the PC. So, the generic build is the only one guaranteed to work with any machine flawlessly, while you can compile yourself the program so compiler makes more optimized and cpu specific code.
Is not regarded as important to add more optimizations to generic build? Is it just too costly at this stage?
I apologize for being ignorant on programming matters.
Thank you all.

(Ingo Weyrich) #12

No, the generic build for an architecture like x86/64 will work an any machine with this architecture, but not on machines with a different architecture (e.g. x86/32). The generic x86/64 builds will use the SSE(2) code in RT, but not the SSE4 code for example.

e.g. for using SSE4 code there are some cases in RT where the SSE4 code needs for example one instruction where the SSE2 code needs three instructions. This decision (whether to use the one SSE4 instruction or the three SSE2 instructions) has to be made at compile time. Deciding at runtime whether to use the one SSE4 instruction or the three SSE2 instructions would be slower than the gain by using the one SSE4 instruction or bloat the code a lot.

(Flössie) #13

Ingo, GCC 4.8+ (apparently with some improvements in v6) makes this possible with Function Multiversioning. LWN has an article on that matter. Don’t know if it’s also supported with Clang. Have you already experimented with it?

(Ingo Weyrich) #14

I did not yet experiment with Function Multiversioning.

But I don’t think it makes sense for e.g. this case which is heavily used in amaze demosaic.

Of course it would be possible to make multiple versions of e.g. amaze demosaic but that’s not really maintainable

(Flössie) #15

I see. No, that wouldn’t work.

Absolutely. If the use of vself() wasn’t so scattered, one could use FMV for larger blocks, but that’s not the case.

(Stefan) #16

I wonder, would it make sense to offer an additional “optimized” build alongside the generic x64 one that can make use of the most common things like e.g. SSE4 available in newer CPUs? I guess most people using RT will have one of the newer i5/i7 CPU varieties, so it seems a bit strange to “cripple” the main build for the few people with very old CPUs. Or would the speed difference be very minor anyway?


For windows, I can build for GCC “skylake” architecture (using -march=native on my machine). I will upload today to test.
I uploaded RawTherapee_dev_5.1-115-g3281332b_WinVista_64_skylake.zip
at https://drive.google.com/open?id=0B2q9OrgyDEfPMVltRXRMQ1JoMTQ built with -march=native on my machine

For those interested, following flags are generated by GCC:

-march=skylake -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mclflushopt -mxsavec -mxsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-mwaitx -mno-clzero -mno-pku --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=skylake

You will find the corresponding generic build RawTherapee_dev_5.1-115-g3281332b_WinVista_64.zip at the usual place: https://drive.google.com/open?id=0B2q9OrgyDEfPS2FpdDAtMVI1RG8


@sTi @gaaned92
I like this, luckily I’m Skylake, so will be installing this optimised build.

If anyone wonders whether their chip is Skylake or not, look here

(Ingo Weyrich) #19

will be more generic and also gives a good speedup for Amaze, Shadows/Hihghlights, ciecam02 and some other tools…

(Andrew) #20

@heckflosse, hi Ingo, I finally got round to building RT to see if a native compile goes faster for my i7 6700K. And it goes much faster! - it seems. Using the same raw and PP3 (with plenty tool usage) under Ubuntu 16.10 I got these results -

  1. RT 5.1 via the usual PPA “dhor” - 21 seconds from hitting OK on Save (Jpeg) to RT saying “Ready”
  2. RT 5.2 Dev compiled today - 12 seconds.

Does anyone know if any speed-ups have been coded into 5.2 since 5.1 please? If not it seems like a great speed increase.

Here are the 5.2 details in case anything is odd…
Version: 5.2-5-g10822f5
Branch: 5.2-5-g10822f5
Commit: 10822f5
Commit date: 2017-07-27
Compiler: cc 6.2.0
Processor: x86_64
System: Linux
Bit depth: 64 bits
Gtkmm: V3.20.1
Build type: release
Build flags: -std=c++11 -march=native -Werror=unused-label -fopenmp -Werror=unknown-pragmas -Wall -Wno-unused-result -Wno-deprecated-declarations -O3 -DNDEBUG
Link flags: -march=native
OpenMP support: ON
MMAP support: ON

I got some error messages when I did sudo apt update - do these matter? -
W: GPG error: http://download.opensuse.org/repositories/home:/rawtherapee/xUbuntu_16.10 Release: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY AAC60BAB587A9447
E: The repository ‘http://download.opensuse.org/repositories/home:/rawtherapee/xUbuntu_16.10 Release’ is not signed.
N: Updating from such a repository can’t be done securely, and is therefore disabled by default.

The “Make” took the cpu up to 74 degrees C, I didn’t like that! I will limit the no. of cores next time!

Thanks RT people for the continuing work on this great piece of software.