Possibly a new deconvolution module for Darktable


(Aurélien Pierre) #61

Some updates here. This horse picture above took me 1h to run 100 iterations on a 9 px blur.

I have found an algorithmic way to accelerate the convergence of the algorithm + I have done some more Cython optimizations. Now I run 705 iterations in 22 min with a 7px blur for the same result.

Plus I have changed my algorithm so that the blur is computed in a separate step from the picture, meaning that the PSF can be stored and saved for later with 75% of the job already done.

The work continues on the Darktable version, with new hope with these figures.

(Jeff Welty) #62

Finally got back to look at the code. Looks like a lot of easy speedups possible (approx 10% improvements possible on the openmp nested for loops. Add a “collapse(2)” to the openmp #pragma lines to enable parallel processing on the second nested for loop. The DT gurus may have a reason not to do that, I’m just looking at it from a pure performance for the loops themselves.

Change this:
#pragma omp parallel for default(none) schedule(static)

to this:
#pragma omp parallel for default(none) schedule(static) collapse(2)

where you have nested “for” loops.

Also, I’m not familiar yet with details, but a lot of loops operate on 4 channels if image->c == 4. I wonder if the algorithm needs that or … should the max channels be something like MIN(3, image->c) ?

Great work so far

(Aurélien Pierre) #63

thanks for your insight ! For the record, I’m not the author of the C implementation, so I’m not familiar with all the low-level details.

The algo doesn’t need the 4 channels, however it seems that in order to take advantage of the SSE instructions, you need to pass vectors of 4 floats (from what I understood, it makes better use of the CPU L1 cache for systematic operations).

I’m just at the beginning of my journey with C. For now, the “higher level” reference implementation is the Cython code, which show better the algorithmic structure of what I do with a mixed C/Python scheme and all the maths references and papers : https://github.com/aurelienpierre/Image-Cases-Studies/blob/master/lib/deconvolution.pyx and the API : https://github.com/aurelienpierre/Image-Cases-Studies/blob/master/richardson_lucy_deconvolution.py

(Jeff Welty) #64

You are doing well given no previous C experience, you are clearly understanding a lot.

Here’s a thought about coding specifically for the SSE instructions.
Since a lot of those loops are simple operations on array elements, here’s what I would do:

make another image structure element, c_used, then when creating the image data add

im->c_used = MIN(3,im->c) ;

In loops, switch the inner and outer loops so the “for(c= …”) becomes the outer loop. The compiler
should automagically do the SSE stuff for you on the inner loop, but I can’t put my finger on a speed comparison of hand-coded SSE instructions vs compiler generated SSE instructions.

for(c = 0 ; c < im->c_used ; c++)

  for(i = 0 ; i < im->size ; i++)
      operation for element[i+c]


If it turns out that the SSE instructions need to be hand-coded, you are set up for the next step of manually unrolling the inner loop (with another check to insure groups of 4 are unrolled, not exceeding im->size)

(Aurélien Pierre) #65

Hi @houz @hanatos @LebedevRI @Pascal_Obry,

I’m still working on this. I have now 3 flavours of the algorithm, from fast and nasty to slow and clean and I wonder : would it be a pain in the neck to cache the output of a module into a temp file on the disk ? Namely, the deconvolution by machine learning is performed in 2 steps:
1- estimation of the blur (and a sharp image) over a sample of the picture (255×255 px or so). Then, the parameters are stored (a square array of floats, between 3×3 and 33×33). This is the most consuming part (4 FFT + discrete gradient / iteration)
2 - deblurring (regularized deconvolution) of either the resized preview (in the darkroom) or the full picture (then, possibly disk-cached). This is just 2 FFT + discrete gradient, but still, it doesn’t need to be recomputed every time the darkroom preview is scrolled or zoomed.

What do you think ?

(Pascal Obry) #66

I’m no expert on this part. But how is this supposed to work on the pixelpipe. I mean if some iop before the deconvolution is changed, then the cache is invalidated and the deconvolution recomputed, right? How “fast” and “slow” is the implementations you’re talking about? How much data to be cached? If some hundred of megabytes, maybe cached in memory? And activated the module only if the computer has more than a given limit of memory?

(Aurélien Pierre) #67

How fast ? Well, it depends on:

  1. the size of the blur to remove
  2. the size of the picture
  3. the parameters used, because the program automatically stops once convergence is reached (thats’s new) to avoid looping for nothing, and the convergence speed depends on the settings

For now, let’s say 2 to 15 minutes for the whole process. I plan to run the initial blur estimation only upon user request (and store it in the database afterwards). The data to be cached would be the RGB values of the whole image as 32 bits floats, so by today’s standards, it would be 288 MB (24 Mpix) to 432 MB (36 Mpix).

But that might be unnecessary since the blur estimation and the debluring are now separated, it is now possible to scale both the picture and the blur kernel to run the IOP only on the preview at screen size. And for that, with Python I’m able to run under 1 min at HD size.


if it cleans up on exit I personally will let it eat as much disk space as needed to save time.