Pro Contrast Moose Peterson

It’s weird because I have even less than that and it works perfectly:

0.811893 [opencl_init] device 0 `Quadro M2200' has sm_20 support.
0.812055 [opencl_init] device 0 `Quadro M2200' supports image sizes of 16384 x 16384
0.812060 [opencl_init] device 0 `Quadro M2200' allows GPU memory allocations of up to 1010MB
[opencl_init] device 0: Quadro M2200 
     CANONICAL_NAME:           quadrom
     GLOBAL_MEM_SIZE:          4044MB
     MAX_WORK_GROUP_SIZE:      1024
     MAX_WORK_ITEM_DIMENSIONS: 3
     MAX_WORK_ITEM_SIZES:      [ 1024 1024 64 ]
     DRIVER_VERSION:           470.74
     DEVICE_VERSION:           OpenCL 3.0 CUDA
0.856024 [opencl_init] options for OpenCL compiler: -w -cl-fast-relaxed-math  -DNVIDIA_SM_20=1 -DNVIDIA=1 -I"/opt/darktable/share/darktable/kernels"
63,301325 [dev] took 0,000 secs (0,000 CPU) to load the image.
63,413158 [export] creating pixelpipe took 0,099 secs (0,250 CPU)
63,413187 [pixelpipe_process] [export] using device 0
63,413224 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
63,427966 [dev_pixelpipe] took 0,015 secs (0,013 CPU) processed `point noir/blanc raw' on GPU, blended on GPU [export]
63,439290 [dev_pixelpipe] took 0,011 secs (0,006 CPU) processed `balance des blancs' on GPU, blended on GPU [export]
63,657938 [dev_pixelpipe] took 0,219 secs (0,115 CPU) processed `dématriçage' on GPU, blended on GPU [export]
63,689078 [dev_pixelpipe] took 0,031 secs (0,018 CPU) processed `correction des objectifs' on GPU, blended on GPU [export]
63,728183 [dev_pixelpipe] took 0,039 secs (0,026 CPU) processed `exposition' on GPU, blended on GPU [export]
64,456505 [dev_pixelpipe] took 0,728 secs (2,702 CPU) processed `égaliseur de ton' on CPU, blended on CPU [export]
64,621579 [dev_pixelpipe] took 0,165 secs (1,181 CPU) processed `égaliseur de ton 1' on CPU, blended on CPU [export]
64,713158 [dev_pixelpipe] took 0,092 secs (0,088 CPU) processed `profil de couleur d'entrée' on GPU, blended on GPU [export]
image colorspace transform Lab-->RGB took 0,029 secs (0,018 GPU) [channelmixerrgb ]
64,819916 [dev_pixelpipe] took 0,107 secs (0,072 CPU) processed `calibration des couleurs' on GPU, blended on GPU [export]
64,884695 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
64,884706 [default_process_tiling_cl_ptp] (5 x 3) tiles with max dimensions 3832 x 3833 and overlap 1024
64,884714 [default_process_tiling_cl_ptp] tile (0, 0) with 3832 x 3833 at origin [0, 0]
83,859892 [default_process_tiling_cl_ptp] tile (0, 1) with 3832 x 3139 at origin [0, 1785]
99,507791 [default_process_tiling_cl_ptp] tile (1, 0) with 3832 x 3833 at origin [1784, 0]
118,548666 [default_process_tiling_cl_ptp] tile (1, 1) with 3832 x 3139 at origin [1784, 1785]
134,191398 [default_process_tiling_cl_ptp] tile (2, 0) with 3806 x 3833 at origin [3568, 0]
153,464296 [default_process_tiling_cl_ptp] tile (2, 1) with 3806 x 3139 at origin [3568, 1785]
169,288535 [dev_pixelpipe] took 104,469 secs (104,629 CPU) processed `diffusion ou netteté' on GPU with tiling, blended on CPU [export]
169,288744 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
169,288766 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 4728 x 4924 and overlap 16
169,288768 [default_process_tiling_cl_ptp] tile (0, 0) with 4728 x 4924 at origin [0, 0]
180,851588 [default_process_tiling_cl_ptp] tile (1, 0) with 2678 x 4924 at origin [4696, 0]
185,238803 [dev_pixelpipe] took 15,950 secs (15,894 CPU) processed `diffusion ou netteté 1' on GPU with tiling, blended on CPU [export]
185,238982 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
185,238986 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 3956 x 4924 and overlap 64
185,238989 [default_process_tiling_cl_ptp] tile (0, 0) with 3956 x 4924 at origin [0, 0]
212,373233 [default_process_tiling_cl_ptp] tile (1, 0) with 3546 x 4924 at origin [3828, 0]
235,852099 [dev_pixelpipe] took 50,613 secs (48,805 CPU) processed `diffusion ou netteté 2' on GPU with tiling, blended on CPU [export]
236,136452 [dev_pixelpipe] took 0,284 secs (0,274 CPU) processed `balance couleur rvb' on GPU, blended on GPU [export]
236,206395 [default_process_tiling_cl_ptp] use tiling on module 'filmicrgb' for image with full size 7374 x 4924
236,206407 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 5388 x 4924 and overlap 512
236,206409 [default_process_tiling_cl_ptp] tile (0, 0) with 5388 x 4924 at origin [0, 0]
236,382849 [default_process_tiling_cl_ptp] tile (1, 0) with 3010 x 4924 at origin [4364, 0]
236,519600 [dev_pixelpipe] took 0,383 secs (0,333 CPU) processed `filmique rvb' on GPU with tiling, blended on CPU [export]
image colorspace transform RGB-->Lab took 0,018 secs (0,016 GPU) [colorout ]
236,749322 [dev_pixelpipe] took 0,230 secs (0,197 CPU) processed `profil de couleur de sortie' on GPU, blended on GPU [export]
237,256079 [dev_pixelpipe] took 0,507 secs (0,504 CPU) processed `homogénéisation' on CPU, blended on CPU [export]
237,391051 [dev_pixelpipe] took 0,135 secs (0,637 CPU) processed `encodage écran' on CPU, blended on CPU [export]
237,391441 [opencl_profiling] profiling device 0 ('Quadro M2200'):
237,391446 [opencl_profiling] spent  2,3921 seconds in [Write Image (from host to device)]
237,391449 [opencl_profiling] spent  0,0042 seconds in rawprepare_1f
237,391451 [opencl_profiling] spent  0,0045 seconds in whitebalance_1f
237,391453 [opencl_profiling] spent  0,0022 seconds in border_interpolate
237,391455 [opencl_profiling] spent  0,0085 seconds in rcd_border_green
237,391457 [opencl_profiling] spent  0,0108 seconds in rcd_border_redblue
237,391459 [opencl_profiling] spent  0,0097 seconds in rcd_populate
237,391461 [opencl_profiling] spent  0,0123 seconds in rcd_step_1_1
237,391463 [opencl_profiling] spent  0,0065 seconds in rcd_step_1_2
237,391465 [opencl_profiling] spent  0,0068 seconds in rcd_step_2_1
237,391467 [opencl_profiling] spent  0,0183 seconds in rcd_step_3_1
237,391469 [opencl_profiling] spent  0,0103 seconds in rcd_step_4_1
237,391471 [opencl_profiling] spent  0,0029 seconds in rcd_step_4_2
237,391472 [opencl_profiling] spent  0,0147 seconds in rcd_step_5_1
237,391474 [opencl_profiling] spent  0,0235 seconds in rcd_step_5_2
237,391476 [opencl_profiling] spent  0,0149 seconds in rcd_write_output
237,391478 [opencl_profiling] spent  0,0521 seconds in [Copy Image (on device)]
237,391482 [opencl_profiling] spent  0,0256 seconds in exposure
237,391485 [opencl_profiling] spent  0,7708 seconds in [Read Image (from device to host)]
237,391487 [opencl_profiling] spent  0,0208 seconds in colorin_unbound
237,391488 [opencl_profiling] spent  0,0238 seconds in colorspaces_transform_lab_to_rgb_matrix
237,391491 [opencl_profiling] spent  0,0268 seconds in channelmixerrgb_CAT16
237,391493 [opencl_profiling] spent 98,7387 seconds in diffuse_blur_bspline
237,391496 [opencl_profiling] spent 69,2394 seconds in diffuse_pde
237,391499 [opencl_profiling] spent  0,0362 seconds in colorbalancergb
237,391501 [opencl_profiling] spent  0,0114 seconds in filmic_mask_clipped_pixels
237,391504 [opencl_profiling] spent  0,0188 seconds in filmicrgb_chroma
237,391506 [opencl_profiling] spent  0,0202 seconds in colorspaces_transform_rgb_matrix_to_lab
237,391509 [opencl_profiling] spent  0,0600 seconds in colorout
237,391511 [opencl_profiling] spent 171,5868 seconds totally in command queue (with 0 events missing)
237,391536 [dev_process_export] pixel pipeline processing took 173,978 secs (175,495 CPU)

It’s slow as hell too, but this is export so it hardly matters (you are not required in front of the computer while exporting).

I can reproduce the problem with the same image and style, with my 8GB GTX 1080. All of the memory allocation in process_cl succeeds, but then wavelets_process_cl fails with the -4 error. It seems like maybe the GPU needs some amount of free memory to run the kernel and when the tile size is chosen to be as large as possible it doesn’t leave enough free. Adding an additional 2 to tiling->factor_cl in tiling_callback fixes the problem on my system.

There is also an option opencl_memory_headroom in darktablerc. Increasing that from the default 400 to 800 also fixes the problem, but if the amount of extra memory it needs is proportional to the image / tile size then it probably would be better to change tiling->factor_cl than to fix it that way.

Edit: same issue came up in this thread, opencl_memory_headroom was the solution.

2 Likes

Just did. 90 seconds without openCL, and 90 seconds with open CL :frowning:
(because it could not run on gpu, falling back to cpu path…)

I will have to try with @paolod’s suggestions.

Have fun!
Claes in Lund, Sweden

Hah, it’s nice to find the solution I posted myself earlier. :slight_smile: I’ve just never considered that with 6 GB on the card I need to worry about the headroom. Anyway, with the headroom set to 800 MB, tiling succeeds and I get:

50.629860 [pixelpipe_process] [export] using device 0
...
51.322878 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
51.322883 [default_process_tiling_cl_ptp] (4 x 1) tiles with max dimensions 4320 x 4924 and overlap 1024
51.322884 [default_process_tiling_cl_ptp] tile (0, 0) with 4320 x 4924 at origin [0, 0]
60.762710 [default_process_tiling_cl_ptp] tile (1, 0) with 4320 x 4924 at origin [2272, 0]
70.206403 [default_process_tiling_cl_ptp] tile (2, 0) with 2830 x 4924 at origin [4544, 0]
74.961507 [dev_pixelpipe] took 23.684 secs (23.518 CPU) processed `diffuse or sharpen' on GPU with tiling, blended on CPU [export]
74.961526 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
74.961528 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 6852 x 4924 and overlap 16
74.961530 [default_process_tiling_cl_ptp] tile (0, 0) with 6852 x 4924 at origin [0, 0]
81.641600 [default_process_tiling_cl_ptp] tile (1, 0) with 554 x 4924 at origin [6820, 0]
82.020020 [dev_pixelpipe] took 7.059 secs (7.005 CPU) processed `diffuse or sharpen 1' on GPU with tiling, blended on CPU [export]
82.020039 [default_process_tiling_cl_ptp] use tiling on module 'diffuse' for image with full size 7374 x 4924
82.020042 [default_process_tiling_cl_ptp] (2 x 1) tiles with max dimensions 5732 x 4924 and overlap 64
82.020044 [default_process_tiling_cl_ptp] tile (0, 0) with 5732 x 4924 at origin [0, 0]
97.672605 [default_process_tiling_cl_ptp] tile (1, 0) with 1770 x 4924 at origin [5604, 0]
100.578624 [dev_pixelpipe] took 18.559 secs (17.528 CPU) processed `diffuse or sharpen 2' on GPU with tiling, blended on CPU [export]
...
101.161714 [opencl_profiling] spent 22.5110 seconds in diffuse_blur_bspline
101.161716 [opencl_profiling] spent 25.9366 seconds in diffuse_pde
...
101.161727 [opencl_profiling] spent 49.4508 seconds totally in command queue (with 0 events missing)
101.161746 [dev_process_export] pixel pipeline processing took 50.532 secs (51.887 CPU)

@Claes : you say your machine is only ‘a little bit better’ than mine; but on the CPU path (OpenCL disabled) I got pixel pipeline processing took 275.756 secs (3153.805 CPU) (all 12 ‘hyperthreaded’ cores in use, CPU running at around 4.2 GHz), while you get 90 seconds… that’s hardly ‘a little bit’. (This is with ND800_0005626_anonymized.NEF, with just the default settings and Aurélien’s style.)

I’ve now recompiled with --build-type Release (I had not specified a build-type previously, so it used the default RelWithDebugInfo), and now got much better timings:

73.566043 [dev_pixelpipe] took 44.513 secs (495.844 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [export]
89.567931 [dev_pixelpipe] took 16.002 secs (180.493 CPU) processed `diffuse or sharpen 1' on CPU, blended on CPU [export]
138.042138 [dev_pixelpipe] took 48.474 secs (551.746 CPU) processed `diffuse or sharpen 2' on CPU, blended on CPU [export]
...
140.268511 [dev_process_export] pixel pipeline processing took 112.221 secs (1261.094 CPU)
1 Like

The headroom still matters even with a large amount of GPU memory, because of tiling. When there isn’t enough memory for the module to process the entire image at once, it breaks it up into tiles that are processed one at a a time. To minimize the number of separate tiles that have to be processed, it tries to make them as large as possible. The largest tile size is the size that fills up almost all of the available GPU memory, leaving only the specified headroom free.

2 Likes