processing that sucks less?

are there some requirements on the graphics card? I managed to compile this on a ThinkPad t460 with an embedded intel GPU, but it gets stuck immediately after opening the main window, with no image displayed and everything becoming unresponsive… (which is quite a pity, since it looks very interesting)

what’s the gpu in this? i developed most of this on a 5yo intel hd5500. you can try ‘-d all’ and see if it outputs useful diagnostics. also double check whether the cfg file you passed to -g has a filename parameter that points to a valid file.

oh , another caveat is that the cli needs different cfg files than the gui. one depends on an export module, the other on a display module (to be improved…). the error handling would probably mostly be non existent at this point, sorry.

thanks for the quick reply. I tried a couple of config files, (gui.cfg and raw.cfg iirc), in both cases making sure I set a valid path name. I’ll run more diagnostics once I get back home next week – I just wanted to check that there were no obvious reasons for the misbehaviour

Is this about darktable not slowing down when many instances are active? If yes, when will it be usable?

No, this is an entirely new experimental pure-GPU pipeline for extreme processing speed.

I do understand that it is actually a different/new program (“complete rewrite”), but isn’t it supposed to do something similar as old darktable eventually, just faster? or are there no plans for it at all?
hope this does not raise too high expectations

It is about a number of things; e.g., efficiency, speed, taking advantage of the GPU and removing the burden of dependencies. Less about features and more about programming a lightweight and efficient raw and photo processor.

yes, all that. and maybe to add to it: whatever you expect from it at this point you probably have your expectations set too high. :slight_smile:

fwiw i think i fixed my performance counters. quick heads up: nvidia GPU takes like 3ms to demosaic a 30MPix x-trans image, intel laptop hd5500 230ms.

as a comparison: current darktable/git discards this intel GPU as completely useless and instead takes a whole second to demosaic the image for export using four threads of the i5 CPU on the same laptop. interactive/region of interest processing (probably close to 2MPix here) takes around 400ms on the same setup (darktable git, intel i5 laptop).


Maybe a little OT
I have a lenovo T450s i7 with HD5500.

With the Intel-neo driver and opencl activated (darktablerc tweaking needed) I gain ~11%. Good enough for me, to keep it switched on.

Indeed I miss my PCs power of two GTX1060 :smile:

Hm. I am not really good at calculating.
I also have an Intel i5 laptop. In darktable 2.6, interactive processing takes a few seconds (1-4) if 10-15 instances with some masks are active. Screen is fullhd and the preview covers 50-75% of it. (That’s normal, isn’t it?)
So if darktable 2.6 had this new “engine”, could I expect it to need maybe 1 second (or less) for the same task, under the same circumstances?
Why is it even that the GPU is faster than the CPU?

GPU shader hardware is inherently designed primarily for performing image manipulation. The original purpose was realtime texture manipulation when rendering 3D games. In general, the GPU shader hardware is usually a massively parallel pile of SPMD (single program multiple data) processing cores.

This architecture makes it excellent for the majority of image processing operations.

darktable achieves this to some degree with OpenCL - but since not everything has an OCL implementation, there are, depending on exact workflow, penalties that arise from GPU<->CPU data transfers. The end result is that depending on your workflow, a weak GPU can potentially be slower. For example, on an i5-7200U (Kaby Lake), the integrated GPU benchmarks as 20-25% slower than a pure-CPU pipeline using darktable’s basic “is this GPU worth using” benchmark. However for a compute-intensive flow with very few GPU<->CPU copies (such as exposure fusion in the basecurve module), that iGPU winds up 4x faster than CPU.

I believe that in the case of something like a GTX1050, this becomes even more important, because the crossing the CPU<->GPU boundary may have more of a memory transfer bottleneck - on the other hand once you’re ON the GPU, if you stay there, it’s MUCH MUCH faster.

What @hanatos is working on here is to basically avoid the CPU at all costs, eliminating the copy/transfer penalties.


In simple terms, a GPU is for specific computations and it is good and fast at those specific computations, such as image processing and 3d rendering.

A CPU is designed for general workloads, such as running your operating system or working on a spreadsheet. A CPU sacrifices speed at specific tasks to be generally acceptable at many tasks.

A GPU wouldn’t be able to run Linux very well, but a CPU does.


I actually have great expectations from your code, at least in the longer term. For me, this would be the natural evolution of my PhotoFlow project…

My idea for a starting point would be to build a simple Qt-based UI interface on top of your GPU code, to provide the UI tools need to interactively build and re-arrange the pipeline. Something similar to what already is done in photoflow, but re-engineered and possibly gathering advice and design guidelines from a wider community of potential users (when I started coding PhotoFlow I was alone at my desk, and with only my own requirements in mind…).

What do you think?

1 Like

Would rearranging the pipeline necessitate a shader recompilation?

Good question! Eliminating the CPU/GPU interaction that allows OpenCL to reorder things like this without much difficulty probably has a negative impact on throughput vs. being able to kick off the GPU to chew on an entire pipeline flow at once.

On the other hand, while that might get the entire pipeline execution time down, it might result in negative consequences for a program where UI interactivity is desired - by requiring a recompile every time you adjust something.

Not sure. Interesting observation - ffmpeg will recompile any OCL kernels it uses at runtime, and this is one way they get rid of conditionals in their performance-critical code. Most things that might require conditionals are settings, so they just generate a header file with the settings in question and compile it along with the main kernel source at runtime. Great approach if you’re feeding thousands of frames through an algorithm with the exact same settings, not necessarily the most optimal in a still image editing flow where UI interactivity might be required.

This is not OpenCL but Vulkan GLSL.

yes, in the long term i agree here. not sure about qt. i think it’s bloatware. i’d much rather pimp the ui elements of imgui, so i’m sure it’s lean and fast. creating a vulkan viewport in qt is probably possible but i don’t know how much overhead it introduces otherwise. certainly comes with a bag full of dependencies that break compilation every so often and will cause code rot.

and yes, i think the discussions about what’s the best linear/tone mapping pipeline in the other threads are super relevant to stabilise the pipeline/processing graph before trying to put this into an end-user product here. another critical point is interleaving multiple pipelines, or processing multiple frames through the same pipeline asynchronously (for thumbnails with default processing when opening a folder for the first time). tons of features missing. and the code is currently still changing substantially. not anywhere near stable enough to build a library from it.

1 Like


so there’s multiple levels of updating the pipeline (see dt_graph_run() and the runflags it receives as argument).

if you play with shader code in your text editor, currently you can press ‘r’ to issue a pipeline recompilation, to interactively see your code changes affect the image. this is updating with a sledgehammer.

if you don’t change the code of your modules, you can easily change the order by using dt_module_connect() (and i want a python or lua or similar interface here). in fact, it’s often times a good idea to separate your problem into multiple individual dispatches (see the graph for the local laplacian above). every one dispatch will run a precompiled shader and consume the input of the previous node. so reordering will not recompile, but it will re-traverse the graph and re-allocate memory buffers. once you keep the order fixed, the memory allocation remains static and no malloc/free equivalents need to be run, the graph remembers.

the least intrusive way of interacting are parameters. you can change these (sliders for instance, if you wired a gui), and they will be copied to uniform buffers or push constants before dispatch.

For me, the main requirement would be to have a way to represent the pipeline through a tree-like UI structure, and be able to re-shuffle the elements via drag-n-drop. I have to dig into the Imgui documemntation and exmaples to see if this is possible. If yes, I have no objection to go in this direction. I actually like the polished and minimalistic look…

cool! i have good trust in the imgui code base. people do crazy things with it:

plus, the code is really easy to go through and assumes next to no dependencies or complicated object oriented bloat structures you have to deal with. if necessary it’s very straight forward to put your own special stuff into it.

1 Like