Weird. Using darktable 3.3.0+1314~gcdaaee146 (git master), everything is much faster on my 12-year-old machine, Core2 Duo, 4 GB of RAM, NVidia 1060 with 6 GB of RAM, using your NEF and XMP:
35.891082 [dev] took 0.000 secs (0.000 CPU) to load the image.
36.154169 [export] creating pixelpipe took 0.236 secs (0.318 CPU)
36.154255 [pixelpipe_process] [export] using device 0
36.155219 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
36.184672 [dev_pixelpipe] took 0.029 secs (0.048 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
36.188469 [dev_pixelpipe] took 0.004 secs (0.004 CPU) processed `white balance' on GPU, blended on GPU [export]
36.194689 [dev_pixelpipe] took 0.006 secs (0.010 CPU) processed `highlight reconstruction' on GPU, blended on GPU [export]
41.377531 [dev_pixelpipe] took 5.183 secs (8.119 CPU) processed `demosaic' on CPU with tiling, blended on CPU [export]
41.514513 [dev_pixelpipe] took 0.137 secs (0.145 CPU) processed `lens correction' on GPU, blended on GPU [export]
41.517665 [dev_pixelpipe] took 0.003 secs (0.001 CPU) processed `exposure' on GPU, blended on GPU [export]
41.817558 [dev_pixelpipe] took 0.300 secs (0.421 CPU) processed `tone equalizer' on CPU, blended on CPU [export]
41.837743 [dev_pixelpipe] took 0.020 secs (0.018 CPU) processed `input color profile' on GPU, blended on GPU [export]
41.882531 [dev_pixelpipe] took 0.045 secs (0.022 CPU) processed `denoise (non-local means)' on GPU, blended on GPU [export]
41.908088 [dev_pixelpipe] took 0.026 secs (0.000 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
41.936057 [dev_pixelpipe] took 0.028 secs (0.012 CPU) processed `local contrast' on GPU, blended on GPU [export]
41.942914 [dev_pixelpipe] took 0.007 secs (0.003 CPU) processed `output color profile' on GPU, blended on GPU [export]
42.020912 [dev_pixelpipe] took 0.078 secs (0.086 CPU) processed `dithering' on CPU, blended on CPU [export]
42.059684 [dev_pixelpipe] took 0.039 secs (0.061 CPU) processed `display encoding' on CPU, blended on CPU [export]
42.059903 [opencl_profiling] profiling device 0 ('GeForce GTX 1060 6GB'):
42.059915 [opencl_profiling] spent 0.0519 seconds in [Write Image (from host to device)]
42.059989 [opencl_profiling] spent 0.0016 seconds in rawprepare_1f
42.060059 [opencl_profiling] spent 0.0015 seconds in whitebalance_1f
42.060126 [opencl_profiling] spent 0.0027 seconds in highlights_1f_lch_bayer
42.060204 [opencl_profiling] spent 0.1669 seconds in [Read Image (from device to host)]
42.060272 [opencl_profiling] spent 0.0330 seconds in [Write Buffer (from host to device)]
42.060339 [opencl_profiling] spent 0.0006 seconds in lens_vignette
42.060407 [opencl_profiling] spent 0.0028 seconds in lens_distort_lanczos3
42.060473 [opencl_profiling] spent 0.0004 seconds in exposure
42.060539 [opencl_profiling] spent 0.0005 seconds in colorin_unbound
42.060618 [opencl_profiling] spent 0.0002 seconds in nlmeans_init
42.060683 [opencl_profiling] spent 0.0039 seconds in nlmeans_dist
42.060748 [opencl_profiling] spent 0.0026 seconds in nlmeans_horiz
42.060812 [opencl_profiling] spent 0.0057 seconds in nlmeans_vert
42.060877 [opencl_profiling] spent 0.0089 seconds in nlmeans_accu
42.060942 [opencl_profiling] spent 0.0006 seconds in nlmeans_finish
42.061014 [opencl_profiling] spent 0.0004 seconds in [Copy Image (on device)]
42.061079 [opencl_profiling] spent 0.0109 seconds in eaw_decompose
42.061144 [opencl_profiling] spent 0.0038 seconds in eaw_synthesize
42.061210 [opencl_profiling] spent 0.0004 seconds in pad_input
42.061276 [opencl_profiling] spent 0.0039 seconds in gauss_reduce
42.061339 [opencl_profiling] spent 0.0029 seconds in process_curve
42.061403 [opencl_profiling] spent 0.0043 seconds in laplacian_assemble
42.061467 [opencl_profiling] spent 0.0004 seconds in write_back
42.061531 [opencl_profiling] spent 0.0010 seconds in colorout
42.061594 [opencl_profiling] spent 0.3119 seconds totally in command queue (with 0 events missing)
42.061730 [dev_process_export] pixel pipeline processing took 5.907 secs (8.953 CPU)
The main differences:
You: 39,631271 [dev_pixelpipe] took 21,260 secs (74,266 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]
Me: 41.817558 [dev_pixelpipe] took 0.300 secs (0.421 CPU) processed `tone equalizer’ on CPU, blended on CPU [export]
You: 58,275698 [dev_pixelpipe] took 18,082 secs (2,172 CPU) processed `denoise (non-local means)’ on GPU, blended on GPU [export]
Me: 41.882531 [dev_pixelpipe] took 0.045 secs (0.022 CPU) processed `denoise (non-local means)’ on GPU, blended on GPU [export]
You: 97,638092 [dev_pixelpipe] took 39,362 secs (138,781 CPU) processed `contrast equalizer’ on CPU with tiling, blended on CPU [export]
Me: 41.908088 [dev_pixelpipe] took 0.026 secs (0.000 CPU) processed `contrast equalizer’ on GPU, blended on GPU [export]