The real challenge is how to optimize a binary and still make it portable for distribution (be it linux distros or windows/mac installer).
The challenge is (ordered by priority from dev perspective)
optimize the code to perform well on a single core scalar arch
vectorize the code
parallelize the code
let the builders do the remainig parts
yeah see … the last part does not scale … I have that for darktable and it becames a nice matrix of
For darktable @aurelienpierre was investigating target clones for the important functions in this pull request. It still needs some work to be portable. Another option might be a kind of fat binary (kinda target clone on the whole binary), But that increases the binary size a lot probably.
And once you have all the target clone stuff working, you need to identify all the functions that would benefit from it. and it might sometimes surprise you what functions can benefit from SIMD. did you know that some string functions in the glibc use SSE/AVX?
Target clones is probably something we should also consider for librtprocess.
What’s wrong with my way?
- optimize the code to perform well on a single core scalar arch
means the math is optimized
That’s the base for the next steps
I am only referring to your packager part … should have probably quoted that. of course it is the base … but the 4th step is quite a fun challenge as well
Thanks everyone for the interesting comments!
The discussion on what the future could be is very interesting. For example, I also would like to know what are the optimizations one could do at compile stage for specific cpus/gpus; I have compiled DT myself even if I’m not a programmer so maybe a short note/guidelines to be included in the online manual or even here could be interesting for me (and I guess for many others; certainly not for the majority of the users that will be however happy with the official releases).
About the state of things in DT development (what @aurelienpierre defines as an “anarchic”) I guess unless somebody steps up and all other developers agree on this, we may have to be happy with the current situation. It certainly is an incredible success regardless to have Darktable in the current state, that gives people like me (Linux and open source supporters) a viable alternative to commercial apps.
About the present situation: I have rewatched a few online tutorials, e.g. the nice visual-only videos by @s7habo and I have noticed that the way DT reacts interactively to user input it perhaps similar to my experience (as opposed to the fluidity that Lightroom has, as I mentioned at the beginning). I may have to do some short screencasts showing what is my personal experience just to understand if this is the reality that all other users live with.
I have also re-discovered some benchmarking that maybe we could do all together as a community and gather the results; it is mentioned here with some exotic machinery that I will not even start to compare my laptop to, and here. These are the files (raw and processing history):
These are the command to run DT from command line with and without openCL:
$ darktable-cli bench.SRW test.jpg --core -d perf -d opencl $ darktable-cli bench.SRW test.jpg --core --disable-opencl -d perf -d opencl
You will get some stats and take the last line as indicative of the time spent to do the entire processing, e.g.:
15,684476 [dev_process_export] pixel pipeline processing took 12,885 secs
I’ve run them a few times then averaged the results and what I get on my laptop (Dell XPS-15 with i7-7700HQ@2.8Ghz, Geforce GTX1050, 16 Gb ram, 512 Gb ssd) is:
~13 secs with opencl
~80 secs cpu only
I have also played with opencl parameters following some advice found here but can’t say I have noticed a lot of differences. I’m not sure if this benchmark tests the fluidity of the user epxerience, which is what I’m after, but at least if I can compare my results to those of similar machines I will at least get to know if my laptop is behaving “as expected” or not.
I was on the verge of publishing some clockings
using different CL settings in darktable (using one/two
gfx) after having compiled it using different CFLAGS –
but since you just mailed that bench file, I simply
must check those figures first.
I’ll be back…
That was an interesting test, @aadm. Thank you. Your test method is much better than mine.
Here are my clockings as per Alessandro’s suggestions, above:
With compile parameters -O2 march=native:
~ 3 secs with opencl
~ 12 secs cpu only
With compile parameters -O3 march=znver1:
~ 3 secs with opencl
~ 12 secs cpu only
Leaving out my special compile parameters:
~ 3 secs with opencl
~ 12 secs cpu only (just 8 hundreths of a second slower)
This, of course, was a disappointment, since I had hoped that Ryzen specific parameters would have given more impact
Ryzen 7 2700X; GTX-1660; 16 GB RAM;
if your machine is a zen1 platform then your march=native is znver1.
you can see that with a simple helloworld.c and then using
gcc -v -march=native -o helloworld helloworld.c
you will find the march gcc then picked internally in the output.
as a distro packager we can not use that. we can only use
-DBINARY_PACKAGE_BUILD=1 which will give us the generic march. that would be a proper comparison then for you. make sure to use make clean between builds.
No, it is a zen2.
CPU only: 32,8 s
GPU: 6,6 s
CPU: Topology: Quad Core model: Intel Core i7-2600 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 2236 MHz min/max: 1600/3800 MHz
GPU: Device-1: NVIDIA GP106 [GeForce GTX 1060 6GB] driver: nvidia v: 390.87 Display: x11 server: X.Org 1.19.2 driver: nvidia resolution: 1920x1200~60Hz OpenGL: renderer: GeForce GTX 1060 6GB/PCIe/SSE2 v: 4.6.0 NVIDIA 390.87
Memory: RAM: total: 15.63 GiB @1333
Kernel: 4.19.0-0.bpo.1-amd64 x86_64
My base system (CPU, MB, RAM) is from early 2012. CPU i7-2600K was released in 2011. Your i7-7700 is from 2017. Quite a heavy difference between desktop and mobile CPU here.
CPU only: 15.4s
GPU: 3.8s (only one of two GPUs utilized)
Intel® Core™ i9-9900K CPU @ 3.60GHz
32GB ram DDR4-2133 (very little improvement if I run at X.M.P. II )
NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1) with nvidia-drivers-418.43
- you have two GPUs in your PC?
- I assume GTX1660, as you wrote above, was a typo and is GTX1060?
- During real processing, definitely my two GPUs are working, but for this bench test I couldn’t make it, you have any idea?
- at GPU we should be more closer in results, I guess, but is not, even mine is a G1 with factory OC, yours?
- acc. to phoronix.com the core i9 should be faster than the Ryzen 2700, but seems, it is not or you use OC? (anyway thought I need the single core performance as well, that was my decission making on intel, but this is off-topic now)
No, it is not a typo.
Additional tech data PM’d.
Yes, there is no parallel export.
Did you build dt on your own? Default for
--build-type=Release is faster?
@Claes: thanks a lot, replied PM-ish
@pk5dark: yes, currently I am on git-master and just use the build.sh (actually from within a script, that does the update, build and install all in one). What does this “=Release” actually do?
I am on Gentoo and there you also always build from source. Any chance, if I want and if I can actually go back to official release, to implement this via USE flags?
Thanks in advance!
- CPU: 87,830 secs (320,479 CPU)
- OpenCL: 9,301 secs (14,804 CPU)
AMD® A8-7600 radeon r7, 4GB RAM
GeForce GTX 1050 Ti/PCIe/SSE2
I used Gentoo for some years. still don’t get if you do
emerge darktable or whatever the command to install is or if you have your on script which does runs
./build.sh from the dt source dir.
IRC people discussed on the mailing-list? that there are some differences in speed.
Haven’t found any sources related to dt.
Interesting. I have performed the test and:
21.805369 [dev_process_export] pixel pipeline processing took 21.397 secs (23.753 CPU) [export_job] exported to 'test.jpg' 22.322860 [opencl_summary_statistics] device 'Intel(R) Gen9 HD Graphics NEO' (0): 550 out of 550 events were successful and 0 events lost
21.474766 [dev_process_export] pixel pipeline processing took 21.089 secs (160.353 CPU) [export_job] exported to 'test_01.jpg'
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (HD Graphics 530) with 32 GB RAM. Didn’t expect this.
@maf Intel’s openCL driver is apparently really bad and you should not use it.
Might it be that you have inadvertently destroyed the xmp file?
Try to fetch both files again and re-run the test.