darktable speed (in general, and when using two monitors)

aadm · April 4, 2019, 2:40pm

I have been fixating with the performance of DT since I first started to use it. Things have improved, I have been asking support also here in the forum, and I quite satisfied with everything… until the other day when I opened Lightroom on my old macbook pro and I was surprised to see how fast it felt when modifying exposure and tone curves etc; everything happens in real time, very fluid etc.

Consider this: I am working on the same 24Mp file, and the macbook is a 2014 model, core i5 with 8Gb of ram, internal intel gpu – compared to the much more modern Dell XPS with i7, 16 Gb of ram and Geforce gpu.

Needless to say, but I am still grateful for DT existence, happy with the overall system, and will never go back to Mac Os/Lightroom but there is maybe space for some improvements in terms of usage. It seems to me like most of the discussions here are very technical and few people comment on a more generic usage like fast culling of photos, quick edits that get propagated, etc.

For example I frequently use darktable-generate-cache to make previews and be able to quickly scan through a series of images taken during a day, but that involves importing images in DT, taking note of the oldest photo-id of the batch, quitting, running the binary from the cmd line, opening DT again… if I were to convince a friend to switch over from LR I would have a hard time explaining that he needs all of these steps just to do a first culling!.

Anyway, first question is this: what’s the performance like for the rest of you; is a simple exposure variation lagging somehow or fluid?

What I’ve learnt to do is this (apart from calculating previews separately like I’ve just said); I set the simplest demosaicing algorithm (e.g. VNG for Fuji files instead of Markensteijin, or PPG instead of Amaze for Nikon files), deactivate all denoising modules, do the actual editing which is usually a combination of exposure, tone curves, crop, etc. After this, at the very end, I change demosaic algos and activate denoise.

This gives me a little more speed I feel, but what I discovered the other day is that by switching off my second monitor DT feels much faster! I haven’t been able to make an objective comparison but I wonder if also other people have found this to be true.

My standard setup is this: Dell XPS 15 plugged in a thunderbolt docking station Dell TB16, driving the main monitor, a Benq 27" 2560x1440px and an older Nec 23" EA231wmi 1920x1080. I should also point out that usually DT runs full screen on the Benq monitor, and the other one is to the side with maybe a chrome window opened on the online manual or something, so it’s not actually used by DT.

rawfiner · April 4, 2019, 2:51pm

Hi
darktable’s darkroom performance is affected by screen resolution.
Is the resolution of your mac the same as your benq?

To compute a fast preview, darktable downscales the image right after demosaic so that the algorithm can run on a smaller image. If screen resolution is higher, I guess the image used for preview will have more pixels, and preview will take more time

paperdigits · April 4, 2019, 3:02pm

If by disconnecting the second monitor you’ve also closed chrome, then you have way more free ram to use

aadm · April 4, 2019, 10:42pm

@rawfiner Uhm well the macbook has a retina/hidpi display (2560x1600px) while my Dell only has to drive a HD display 1920x1080px (internal) or the Benq which is again 2560x1440px. So the mac drives a similarly-sized or even bigger display (in terms of pixels) with a much more modest cpu+gpu combo, and still is way faster than DT.

I have just tried to use DT on just my internal laptop display and it still is laggy and not fluid when varying exposure.

@paperdigits alright, I am closing Chrome too but still no marked differences.

paperdigits · April 5, 2019, 3:34am

I don’t remember @aadm, are you using the proprietary nVidia driver?

aadm · April 5, 2019, 5:59am

Yes, nvidia-390.

aadm · April 5, 2019, 6:43am

I will also add the underlying reason for this criticism.

DT development seems to me pointing towards adding lots of features to what is an already impressive piece of software as it is, perhaps at the expense of consolidating the existing functionalities and making the whole experience a bit smoother and faster.

I like what is already available in DT, and I can see myself instructing friends that are still in the Mac/Adobe world how to replicate their Lightroom workflow with Darktable. But they would have to be much more patient, because the overall experience will not be as smooth.

Is it my impression that developers neglect this particular aspect, the user experience/interaction, to the advantage of adding more features?

Features like the latest “culling” view in Lighttable that maybe provides an alternative way of editing (in the sense of selecting) photos (but honestly I was already happy with the standard file-manager mode).

Or darkroom modules that can be very complex to use, sometimes providing overlapping functionalities with others (incidentally, I wish there could be a way to make a survey on all DT users and get statistics on each module’s usage to decide what are the most used ones. In this way perhaps the developer could focus their attention on making these modules better/faster).

I’ll go back to my brief experience opening Lightroom again after months; on a much slower computer LR seemed so much more refined and quicker in very simple actions (stuff like scrolling through a series of photos; adding metadata, stars, etc; quickly develop a photo modifiying exposure and tone curves, with changes visualized in real time).

As I said many times already, I have started to get accustomed to “my” workflow in Darktable to do exactly the same things as above; adding metadata and making an edit is a bit more cumbersome because for example I would need to generate previews externally with darktable-generate-cache, and once I tag or add stars to a series of images I know that it is not super-responsive if I have say 30 photos selected. In darkroom, I can play with simple tuning of basic modules even if I have to be patient.

I mean, I can now do everything I used to do in LR but believe me the experience is not as fluid or fast as it could (or should) be, considering the type of machine I’m working on.

Is this perhaps a reflection of the typical user base of DT? Perhaps more hobbysts than “professional”? More nerdy-photographers than real-life-photographers that enjoy more the technical aspect of getting the absolute best out of a single image rather than building coherent edits without extensive processing?

I hope that the developers and other enthusiasts giving their free time to DT will not take offence at my words; I say these things because I love DT and I just wish it could be even better than what already is. Again, I can see the struggle I would have to convince friends to switch over from LR if I’m not able to show them a basic experience on par with LR; most of them will not care one bit if the denoising algorithm is better or if I can do crazy things in processing; they will just see DT as a slower, more cumbersome way to do photo edits.

pk5dark · April 5, 2019, 8:49am

hmm, I’m using a quite old PC with i7-2600k and a Nvidia GTX1060 with 4 GB of RAM and dt doesn’t feel slow here.

But I guess it is because of the relatively low screen resolution with 1920x1200?

I used darktable-generate-cachea long time ago and don’t understand what the benefit should be.
I guess it would be interesting what really causes the slowness in your case.

aadm · April 5, 2019, 10:47am

Well that’s interesting to know! After many months of intensive usage I cannot say that my computer is overall “slow”, not at all.

Also, please keep in mind that my experience doesn’t change much between using the laptop display (1920x1080) and the external monitor (Benq 2560x1440px).

darktable-generate-cache simply computes the full size previews after you import a number of photos; so that I can go full screen in lighttable mode and flick from one to the other without any lag. Do you do the same? Does the preview generation feel instant? Or maybe you are using the jpeg previews attached to the raw file?..

Another detail: I do have a large library (~46k photos, all on external USB3 drive) but the experience I’m describing is relative to a small number of photos (~1000) that I have made local copies onto my (fast) internal ssd.

pphoto · April 5, 2019, 11:40am

I really like the fact having multiple ways to solve a problem than having just one slider. The best method varies from case to case and with its various modules Darktable gives me this choice.

On the other hand there are long processing times slowing me down. My computer is a AMD A8-7600 with 4GB RAM and a GeForce GTX 1050 Ti GPU.

The amount of processing times mostly depend on the number of modules I use. This is my performance (OpenCL enabled) opening a photo with a simple edit of a test shot I made with a new lens:

And this is the performance of opening a portrait photo with heavy editing that took me several hours to complete (history is compressed):

For example modifying a parameter of the ‘defringe’ module with a preview in ‘fit in window’ size results in

4697,973936 [dev_process_image] pixel pipeline processing took 8,868 secs (21,064 CPU)

In a 100% view where I can see the changes’ results:

4840,924090 [dev_process_image] pixel pipeline processing took 12,053 secs (29,645 CPU)

Of course I could say ‘buy a better PC for better performance’ but this would not remove the underlying problem.

I do a second test with changing a parameter of the ‘soften’ module with the result:

192,391544 [dev_process_image] pixel pipeline processing took 4,891 secs (9,364 CPU)

Exporting the detailed times to Libreoffice’s Calc and adding the seconds from CPU and GPU usage I see:

74 calculations for the GPU in 2,548 seconds
10 calculations for the CPU in 2,279 seconds

The steps using CPU are

highlight reconstruction
demosaic
spot removal
spot removal 1
haze removal
defringe
output color profile
watermark
dithering
gamma

Without OpenCL the change takes 7,005 seconds.

I would be much happier with better performance than with new modules.
In my opinion future versions should focus on:

Enabling GPU usage for all modules
Performance optimizations

pk5dark · April 5, 2019, 12:15pm

I do not use the embedded jpeg of the raw. After import I do most of the image selection in darkroom mode before further tweaking. Even if I have my images on the internal rotating drive it is fast enough to switch between images for me. It is simple the power of the nvidia GTX 1060 which makes this possible. Without openCl it would be to slow of course.

This full size previews are quite new in dt. Switching around in darkroom mode is fast enough, so I didn’t looked into that.

It is quite interessting how different user experience can be.

paperdigits · April 5, 2019, 4:16pm

I don’t think that the two are exclusionary, e.g. features vs performance. @anon41087856 is working on the GUI currently, while also smoothing out his new filmic module as well.

Perhaps this is part of the problem with open source development, with not having a huge developer base, and with not having paid development/not taking donations.

New features, from a development point of view, is fun and exciting, gets lots of positive and wow from users. Polishing the UX/UI, performance gains, and the 10-20% polish that takes an application from great to outstanding is not glorious and takes a lot of debugging time.

Bringing the wants to the front of the conversation is a good first step. But where do we go from here? Here are some suggestions:

bug reports that are hyper-focused. If you can provide some debug information, that’d be awesome.
recruit move developers to work on darktable, particularly ones with UX/UI experience
find some way to compensate people for their work in the direction you want darktable to go
keep engaging with developers in a positive way

anon41087856 · April 6, 2019, 12:07pm

You have no idea what you are asking.

GPU codepaths use OpenCL, an intrinsically vectorized language/lib. Translating C algorithms to OpenCL is not trivial, and requires a serious set of abstraction skills. Beside, not every algorithm is a good candidate for GPU off-loading, for example those with few computations and large memory accesses (tone curve for example) show virtually no improvement in OpenCL.
“Optimization” is an empty word. We can only optimize for a specific (known) hardware, at the expense of maybe degrading the perf for other types of hardware or losing compatibility. For example, in OpenCL, a huge penalty comes from moving the data buffers back and forth between the RAM and the vRAM, especially for low end GPUs. We could overcome that by using different cache management strategies (tiling, etc.), but should we optimize for high-end or low-end targets ? (because it’s not always just a matter of setting a tile size in users prefs, sometimes the code should be adapted).

You can get a “free” performance boost by compiling yourself from the sources, so the compiler will be able to unleash specific optimizations for your particular CPU. Packaged versions of dt use vanilla “one-size-fits-all” optimizations to satisfy all users.

But there is no new features vs. perf priority in dt. Most of the newbies are contributions from devs who have no background in high-performance computation. I have done several campaigns of code optimization (see Filmic perfs…), but I develop with a Xeon and people with Core i5-7 don’t always reproduce my speed-ups. What am I supposed to do ?

TL;DR we do what we can. Before becoming a dt’s dev, I was a heavy retoucher too, these things matter to me too, but optimizing is tricky, sometimes backfires, and optimizing someone else’s code needs time and understanding of what it does, taking the risk of breaking the feature in subtle ways.

We need devs who understand at the same time color management, image processing, maths (because lots of optimizations can be obtained by just factorizing or reordering computations), low level memory/cache use, paralellization and vectorization. These people usually have already well-paid jobs, social security, and little spare time.

dt development is fully anarchic. Contributors propose code that get eventualy merged by core devs (aka Pascal Obry, these days). There is no management, no long-term project, no general goal. So, as long as contributors propose new features, new features will be merged.

I have been working full-time on that for the past 5 months, and almost full-time the for the 4 previous months. The skill set required to only begin to understand what’s going on under the hood is worth a master degree itself.

pphoto · April 6, 2019, 3:06pm

Thank you very much for giving clearance in this case and insights into the project.

I see that in these circumstances it is not realistic to ‘just shift some developer’s work into performance optimization’.

That leaves me with the fact that processing time increases with the amount of megapixels in the image, the screen resolution and the amount of edits I make.

So what can I do?

compiling DT myself: never compiled anything before, but I will do some research and evaluate if I can do that by myself
Experimenting with OpenCL optimization. I will set up some test cases with typical edits from myself and will run through them with various OpenCL configurations.
https://www.darktable.org/usermanual/en/darktable_and_opencl_optimization.html
Get better hardware. Since more complexity results in higher hardware requirements I evaluate various upgrade options for my CPU and GPU.

anon41087856 · April 6, 2019, 3:18pm

This is always going to be the case, no matter how much you optimize, how much you pay your hardware, and especially since better algorithms tend to be heavier (demosaicing, denoising, sharpening, highlights reconstruction, etc.). There is no magic here.

darix · April 6, 2019, 3:52pm

for openSUSE and IIRC Fedora we have packages with optimizations for some CPUs.

AxelG · April 6, 2019, 5:58pm

@pphoto
from darktable manual you can see, it supports even two different opencl devices.

My experience: I bought my self a second GTX1060 G1 and the speedup is amazing.

I got the feeling, as long you have just one GPU no matter, not fully loaded, the next bigger task goes to the CPU and that changed. My machine is a beast now.

So that is a nice and not so expensive way. And don’t worry about PCIe x12 or x8. It depends on generation and just check the real transfer rates… x8 is enough, usually…

heckflosse · April 6, 2019, 10:53pm

A must read for C(++) coders concerning optimizations

darix · April 6, 2019, 10:58pm

The real challenge is how to optimize a binary and still make it portable for distribution (be it linux distros or windows/mac installer).

heckflosse · April 6, 2019, 11:05pm

I disagree.

The challenge is (ordered by priority from dev perspective)

optimize the code to perform well on a single core scalar arch
vectorize the code
parallelize the code
let the builders do the remainig parts