After getting my graphics card up and running [1] and letting my performance tester [2] run I thought I had a bad dream. But another run just confirmed the results:
darktable was now slower at about 70% of the CPU performance.
My machine is nothing edgy – the i7-6700 was introduced in 2015 – and I know that the GT1030 by todays measures is really on the low end. But a little bit of improvement should be there?
Apparently darktable gives all the images to the GPU upon export. And yes, system monitor agrees: the CPU is doing (almost) nothing, the GPU is working. I have looked at the various performance settings but it seems a co-existence of CPU and GPU is only used in some parts of the GUI.
I knew that the card was no miracle worker and part of buying it was testing out the low end of GPUs. I also have some other tools I want to try that require OpenCL where performance is no issue at all. So I am neither angry nor sad, just a little astonished. It is a test and a fully deductable one, too.
But what did I miss in the manual or settings that would allow me to utilize the full rendering power of my machine, especially when using darktable-cli or darktable-generate-cache?
The scheduling is default, since there is only one GPU and I’d like to utilize the CPU, too. I’ve tried all the OpenCL performance tuning settings, no difference.
Other than that, darktale has “unrestricted” resources and I use the full disk backend for thumbs and cache.
The 1030 is much (in some tests, 3 times) slower than my 1060 (which, depending on the processing stack, is several times as fast as my Ryzen 5 5600 CPU). I assume the 1030 card also does not have too much RAM, which may then result in tiling, potentially leading to disastrous processing times.
You could try one of the other resource settings…unrestricted might actually not be the best… I think AP did an optimization walk through video back in the day with a resource monitor running and he sort of demonstrated tweaking things wrt system resources… Likely still relevant…
I think @kofa has a nice thread some time ago with all his optimization experiments… I think in his case he found an substantial improvement by tweaking head space and one or two other minor changes… likely he can correct me or direct you to that…
Which is pretty much as expected, yes.
Although even with 36megapixel files I never hit the ceiling of those 2GB if the output of “nvidia-smi -l 1” can be trusted.
And I do see a tangible speedup in the GUI when interacting with images.
Still leaves the main question open: why is the CPU (almost) idling around during export?
Darktable won’t process in parallel when you export. Each module runs either on the GPU or on the CPU. Also, images are not exported in parallel: it’s one image after the other, with one module after the other for each image.
Depends on the stack, and how much memory you tell darktable to set aside. Problem exporting - #31 by kofa – it’s a 24 MB image, and used 4.8 GB of the 6 GB on my card.
Which gives me an idea. If one darktable can only use either CPU or GPU, well then I’ll have to let two darktables run and split the images between them.
A little scripting and a virtualbox should do the trick.
A nice project when I next have some free time available.
It can use both I think if configured in the scheduling…
I think you normally try to keep the GPU busy and I think @kofa bumps his delay before the pipe will switch to CPU so that it doesn’t prematurely jump but you can specify if it is allowed to use CPU as a fall back and for what previews/pipelines… Below taken from the manual
"If a pixelpipe process is about to be started and all GPUs in the corresponding group are busy, darktable automatically processes the image on the CPU by default. You can enforce GPU processing by prefixing the list of allowed GPUs with a plus sign +. In this case darktable will not use the CPU but rather suspend processing until the next permitted OpenCL device is available.
darktable’s default setting for “opencl_device_priority” is /!0,//.
Any detected OpenCL device is permitted to process the center view image. The first OpenCL device (0) is not permitted to process the preview pixelpipe. As a consequence, if there is only one GPU available on your system, the preview pixelpipe will always be processed on the CPU, keeping your single GPU exclusively for the more demanding center image view. This is a reasonable setting for most systems. No such restrictions apply to export and thumbnail pixelpipes.
The default is a good choice if you have only one device. If you have several devices it forms a reasonable starting point. However, as your devices might have quite different levels of processing power, it makes sense to invest some time optimizing your priority list."
Yes, for each module, the image can be divided into several parts, and processing can be done in parallel (just like how a GPU would process hundreds of parts in parallel). The 1030 has 384:24:16 cores (I think it’s the first, the ‘shader’ count, which matters); my 1060 has 1280:80:48; a 4060 has 3072:96:32, a top-of-the-range 4090 16384:512:176.
Virtualbox is bad for performance. You could run two darktable instances in parallel (without virtualbox), but the amount of RAM in your machine could be an issue. I think you are digging yourself into a deeper hole.
Will the database not be locked and or potentially messed up if you are not careful when trying to run 2 instances?? Would you have to go memory as the database option??
I have 32GB on my desktop so that’s not a problem at all, but darktable blocks the database for good reasons. So two on the same OS are out of question. Virtualbox is good enough to use all cores of a CPU available.
And yes, by looking at performance of current GPUs with good performance the non-usage of the CPU for rendering and letting it do all the OS overhead and scheduling and preparing etc isn’t a bad thing at all. That way dt never runs into a I/O limit.
With VirtualBox, you’ll have 2 copies of the DB, right (one in each VM, or one in a VM, one on the host)? There is nothing to prevent you from making 2 copies directly on your machine.
I still think your approach won’t help, though. Getting a 100-dollar used 1060 with 6 GB would be so much simpler and is known to work.
Maybe my approach to this test wasn’t made clear enough: it was a test with a cheap card and the actual performance does not matter most of the time. By starting at the low end, any and all issues can be seen quite clearly and all the things I learned are a big bonus in understanding were I should put my money or do not have to. After all I have been working with dt for three years without a GPU and it worked/works still fine.
If I ever decide that I need a serious GPU I am looking at a 3060 or 4060 and that was quite clear before I got the 1030.
My takeaway here is: either put the money into a great CPU and forget the GPU or get a cheap and average CPU which is still a blast for less demanding tasks and invest in a solid GPU – if darktable performance is the main bottleneck and deciding factor.
32GB RAM for the main CPU but only a GT1030 is… A rather unbalanced configuration.
The problem with trying to mix CPU and GPU utilization is similar to why the GT1030 isn’t performing well here - likely in terms of raw performance the GT1030 is significantly more powerful than the GPU, but NOT powerful enough to offset the penalty of sending data there and back.
This same penalty will actually get WORSE if you try to have the CPU take over some of the tasks, because you’re now doing less work on the GPU but still incurring the send/receive penalties, and/or possibly increasing the send/receive penalties.
As an example, WAY back in the day (2018 or so), on an i5-7200U (which had unified memory making the send/receive penalties less, but still not nonexistent), OpenCL offload to the discrete GPU would show as slower for darktable’s benchmark workload, and many “lighter” workloads, but if you tried to use the synthetic exposure fusion function ( compressing dynamic range with exposure fusion | darktable ), that was compute-intensive enough that OpenCL acceleration would win hands-down.
My Ryzen 5 5600 has a score of 21˙597 at PassMark Intel vs AMD CPU Benchmarks - High End. On the same page, the fastest CPU, with a price of $9˙990 has a performance score of 151˙398, or about 7 times higher. An AMD EPYC 7702 scores 69˙872 (about 3 times higher) and costs $1˙013.00; the first sub-$1000 CPU is the AMD Ryzen 9 7950X with pretty similar performance score (63˙223), and a cost of $540.99.
In my latest test, quoted already above, the 1060 / 6GB was about 2.5 times as fast as the CPU. An NVidia 3060 costs about $300 where I live. The 1060 has a score around 4˙000 at PassMark Software - Video Card Benchmarks - GPU Compute Video Cards (not directly comparable to the CPU scores, of course). A 3060 has a score of over 8˙000, a 3060 Ti over 10˙000, a 4060 over 9˙000, a 4060 Ti over 12˙000.
Very unscientifically, if my 1060 made my exports about 2.5 faster than using my Ryzen, then 1 GPU ‘performance point’ is worth about 2.5 * 21˙597 / 4˙000 = 13.5 CPU ‘performance points’.
That would mean the 3060 would be equivalent to a CPU scoring 108˙000, the 3060 Ti about 135˙000, the 4060 121˙500, and the 4060 Ti about 162˙000. My 1060 would be the equal of a CPU scoring 54˙000 points, placing it in the same league as an Intel Core i7-14700 (~$420) or an AMD Ryzen 9 7900X (~$450).
(Prices based on a local web store, with the very rough conversion ratio 1 CHF = 1 USD.)