Open CL support in dt 4.2 not surviving a suspend/resume cycle on my system; reproducible at will

LateJunction · January 21, 2023, 6:02pm

On my system (4th gen Intel i7, Kernel 5.15.0-58, mint 21.3, 2 GB Zotac Nvidia 1050, 16 GB Ram, darktable 4.2.0) I have a consistently reproducible problem with support of OpenCL: advice would be much appreciated until my opinion that this is most probably a user-error is proven to be correct.

If I suspend my PC with dt running and preferences showing OpenCl support is available, then, after resume, Open Cl support is still available. But if I then close dt and restart it – having made absolutely no changes to my mint install (that I am aware of) and irrespective of any editing I did in dt before closing it, OpenCL support is no longer available on the next start of dt. It continues to be ‘not available’ no matter how many times I restart dt, or Cinnamon. The only way I have found to re-enable OpenCl support is to restart the PC.

In practical terms this means that OpenCl support on my system will not survive across a suspend/resume close/restart cycle. This is a scenario I use frequently. Admittedly there is an obvious work around: close dt before suspending the PC, but this is not entirely efficient, is it? Also this is not a requirement which is explicitly stated in ‘the book’, as far as I know.

The primary purpose of this posting is really to ask if there is any advice as to what might be causing this behaviour, or how I might modify something in Mint, or in dt, so that survival across the suspend/resume cycle works. However, I see a somewhat broader issue: the ‘robustness’ of the support for OpenCl in dt, based on how frequently I find that support is not available.

The manual entry on OpenCl is, as is typical of the whole, comprehensive and well written, as well as being a significant enhancement of just a few versions ago. It is however difficult for people with my level of technical expertise to be able to effectively follow the requirements stated in the manual, especially the required technical skill to work through the implications of: “If anything does not fit … OpenCL support will likely not be available”. So, for example, the output of ‘darktable -d opencl’ after a PC restart is:

[dt_get_sysresource_level] switched to 1 as `default’
total mem: 15964MB
mipmap cache: 1995MB
available mem: 7982MB
singlebuff: 124MB
OpenCL tune mem: WANTED
OpenCL pinned: OFF
[opencl_init] opencl related configuration options:
[opencl_init] opencl: ON
[opencl_init] opencl_scheduling_profile: ‘default’
[opencl_init] opencl_library: ‘default path’
[opencl_init] opencl_device_priority: ‘/!0,///!0,*’
[opencl_init] opencl_mandatory_timeout: 400
[opencl_init] opencl library ‘libOpenCL.so.1’ found on your system and loaded
[opencl_init] found 1 platform
[opencl_init] found 1 device

[dt_opencl_device_init]
DEVICE: 0: ‘NVIDIA GeForce GTX 1050’
CANONICAL NAME: nvidiageforcegtx1050
PLATFORM NAME & VENDOR: NVIDIA CUDA, NVIDIA Corporation
DRIVER VERSION: 525.78.01
DEVICE VERSION: OpenCL 3.0 CUDA, SM_20 SUPPORT
DEVICE_TYPE: GPU
GLOBAL MEM SIZE: 1996 MB
MAX MEM ALLOC: 499 MB
MAX IMAGE SIZE: 16384 x 32768
MAX WORK GROUP SIZE: 1024
MAX WORK ITEM DIMENSIONS: 3
MAX WORK ITEM SIZES: [ 1024 1024 64 ]
ASYNC PIXELPIPE: NO
PINNED MEMORY TRANSFER: NO
MEMORY TUNING: WANTED
FORCED HEADROOM: 400
AVOID ATOMICS: NO
MICRO NAP: 250
ROUNDUP WIDTH: 16
ROUNDUP HEIGHT: 16
CHECK EVENT HANDLES: 128
PERFORMANCE: 5.768
TILING ADVANTAGE: 0.000
DEFAULT DEVICE: NO
KERNEL DIRECTORY: /usr/share/darktable/kernels
CL COMPILER OPTION: -cl-fast-relaxed-math
KERNEL LOADING TIME: 0.4823 sec
[opencl_init] OpenCL successfully initialized.
[opencl_init] here are the internal numbers and names of OpenCL devices available to darktable:
[opencl_init] 0 ‘NVIDIA GeForce GTX 1050’
[opencl_init] FINALLY: opencl is AVAILABLE on this system.
[opencl_init] initial status of opencl enabled flag is ON.
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 -1 0 0 -1
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 0 0 0 0
[opencl_synchronization_timeout] synchronization timeout set to 200
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 -1 0 0 -1
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 0 0 0 0
[opencl_synchronization_timeout] synchronization timeout set to 200

In contrast, after a suspend/resume cycle the output from ‘darktable -d opencl’ is:

dt_get_sysresource_level] switched to 1 as `default’
total mem: 15964MB
mipmap cache: 1995MB
available mem: 7982MB
singlebuff: 124MB
OpenCL tune mem: WANTED
OpenCL pinned: OFF
[opencl_init] opencl related configuration options:
[opencl_init] opencl: ON
[opencl_init] opencl_scheduling_profile: ‘default’
[opencl_init] opencl_library: ‘default path’
[opencl_init] opencl_device_priority: ‘/!0,///!0,*’
[opencl_init] opencl_mandatory_timeout: 400
[opencl_init] opencl library ‘libOpenCL.so.1’ found on your system and loaded
[opencl_init] could not get platforms: Unknown OpenCL error
[opencl_init] FINALLY: opencl is NOT AVAILABLE on this system.
[opencl_init] initial status of opencl enabled flag is OFF.

In the scenario described above, note that OpenCl support changes from being ‘available’ to being ‘not available’ when, as I far as understand it, I have made no change to my system; i.e. it’s not something I did. The (possibly erroneous) conclusion I draw is that either the executable code implementing OpenCl support needs to be more robust, or there is some requirement that I have missed in the manual. Comments

kofa · January 21, 2023, 6:13pm

It’s probably an issue with the drivers.
You can try unloading and reloading modules.

Search the forum for
opencl rmmod

One example:

LateJunction · January 21, 2023, 7:03pm

Thanks for this pointer - I should have found that for myself, so apologies for being lazy.

This technique works to recover OpenCl after a suspend/resume/close/restart cycle but it does not prevent entering the ‘unavailable’ condition in the first place. So this is not the full solution that I would prefer - especially as I will forget the details of these two commands within the hour!

At the risk of appearing negative, this still leaves me with the impression that there is an inherent weakness here in OpenCl support - could these two commands not be implemented in the code, for example? Or is that far too simplistic because it does not cater for the wide range of Linux distros ?

vbs · January 21, 2023, 7:23pm

I experienced it too. Since then I simply quit DT before suspend. (this was the easiest for me).

g-man · January 21, 2023, 7:26pm

I think you need to look into your Mint logs for suspend issues with your nvidia card. Look in Mint forums for issues/resolutions. In fedora, there is this How to that discusses the suspend issues. Howto/NVIDIA - RPM Fusion

One thing to try is to turn off the OpenCL memory tuning in darktable. I dont think it will make a difference, but try it.

K-1 · January 22, 2023, 1:22am

Do other opencl applications behave the same? clinfo for example?

There is very little DT can do, if the opencl driver doesn’t survive suspend.

mbs · January 22, 2023, 1:39am

As others have suggested, this is most likely a problem with the proprietary Nvidia drivers for Linux, which are quite buggy. I had similar issues when I tried an Nvidia card a few weeks ago, or Arch. On exactly the same system, using Intel’s built-in GPU works like a charm (if a bit slower).

Ravn_Revheim · January 22, 2023, 7:49am

I had this problem some time ago, but found an easy solution.

Upon wakeup, try to run these commands (as root) to reload the nvidia modules that doesn’t come back up properly:

modprobe -r nvidia_uvm
modprobe nvidia_uvm

Just make sure that darktable isn’t running first, or this will fail.

My solution was this super simple script so that I have a quick fix on wakeup:

#!/usr/bin/pkexec /bin/bash

pkill -9 darktable > /dev/null 2>&1
sleep 2
modprobe -r nvidia_uvm
modprobe nvidia_uvm

Hope this was helpful.

kofa · January 22, 2023, 7:56am

kill -9 gives darktable no chance to shut down, and can lead to data loss.

Ravn_Revheim · January 22, 2023, 8:00am

I know, but as darktable writes continuously to the db I have never experienced any problems with data loss.
And I have a long standing problem of darktable processes that keeps hanging around after a ‘normal kill’ or shutdown. So -9 was my crude solution.

LateJunction · January 22, 2023, 5:07pm

Thanks for this question - I had no idea that this binary existed. The simple answer is ‘No’; clinfo reports “Number of platforms 0” after a suspend/resume cycle. After a PC restart (or after removing/reloading nvidia_uvm) the output from clinfo is very comprehensive.

rvietor · January 23, 2023, 7:25am

Just keep in mind that while the drivers and GPU are from NVidia, the actual cards can be from several brands.
So that gives three levels for bugs:

NVidia drivers
card firmware
card hardware

kofa · January 23, 2023, 7:35am

You can save the commands in a file (a shell script), and make it executable. As the root user (after a sudo -i or su - command, depending on your distribution), you can put the shell script in /usr/local/bin, so you can invoke it any time.

And no, the commands cannot be put into darktable:

you may need similar commands for other hardware;
the name of modules may change;
those commands need root (administrator) access, while darktable is running with your normal user permissions.

LateJunction · January 24, 2023, 10:05am

Oh, nothing really important, then …

And thanks for the guidance on the shell script; most useful.