Possibly a new deconvolution module for Darktable

deblur
focus
darktable
rawprocessing

(Aurélien Pierre) #61

Some updates here. This horse picture above took me 1h to run 100 iterations on a 9 px blur.

I have found an algorithmic way to accelerate the convergence of the algorithm + I have done some more Cython optimizations. Now I run 705 iterations in 22 min with a 7px blur for the same result.

Plus I have changed my algorithm so that the blur is computed in a separate step from the picture, meaning that the PSF can be stored and saved for later with 75% of the job already done.

The work continues on the Darktable version, with new hope with these figures.


(Jeff Welty) #62

Finally got back to look at the code. Looks like a lot of easy speedups possible (approx 10% improvements possible on the openmp nested for loops. Add a “collapse(2)” to the openmp #pragma lines to enable parallel processing on the second nested for loop. The DT gurus may have a reason not to do that, I’m just looking at it from a pure performance for the loops themselves.

Change this:
#pragma omp parallel for default(none) schedule(static)

to this:
#pragma omp parallel for default(none) schedule(static) collapse(2)

where you have nested “for” loops.

Also, I’m not familiar yet with details, but a lot of loops operate on 4 channels if image->c == 4. I wonder if the algorithm needs that or … should the max channels be something like MIN(3, image->c) ?

Great work so far


(Aurélien Pierre) #63

thanks for your insight ! For the record, I’m not the author of the C implementation, so I’m not familiar with all the low-level details.

The algo doesn’t need the 4 channels, however it seems that in order to take advantage of the SSE instructions, you need to pass vectors of 4 floats (from what I understood, it makes better use of the CPU L1 cache for systematic operations).

I’m just at the beginning of my journey with C. For now, the “higher level” reference implementation is the Cython code, which show better the algorithmic structure of what I do with a mixed C/Python scheme and all the maths references and papers : https://github.com/aurelienpierre/Image-Cases-Studies/blob/master/lib/deconvolution.pyx and the API : https://github.com/aurelienpierre/Image-Cases-Studies/blob/master/richardson_lucy_deconvolution.py


(Jeff Welty) #64

You are doing well given no previous C experience, you are clearly understanding a lot.

Here’s a thought about coding specifically for the SSE instructions.
Since a lot of those loops are simple operations on array elements, here’s what I would do:

make another image structure element, c_used, then when creating the image data add

im->c_used = MIN(3,im->c) ;

In loops, switch the inner and outer loops so the “for(c= …”) becomes the outer loop. The compiler
should automagically do the SSE stuff for you on the inner loop, but I can’t put my finger on a speed comparison of hand-coded SSE instructions vs compiler generated SSE instructions.

for(c = 0 ; c < im->c_used ; c++)
{

  for(i = 0 ; i < im->size ; i++)
      operation for element[i+c]

}

If it turns out that the SSE instructions need to be hand-coded, you are set up for the next step of manually unrolling the inner loop (with another check to insure groups of 4 are unrolled, not exceeding im->size)


(Aurélien Pierre) #65

Hi @houz @hanatos @LebedevRI @Pascal_Obry,

I’m still working on this. I have now 3 flavours of the algorithm, from fast and nasty to slow and clean and I wonder : would it be a pain in the neck to cache the output of a module into a temp file on the disk ? Namely, the deconvolution by machine learning is performed in 2 steps:
1- estimation of the blur (and a sharp image) over a sample of the picture (255×255 px or so). Then, the parameters are stored (a square array of floats, between 3×3 and 33×33). This is the most consuming part (4 FFT + discrete gradient / iteration)
2 - deblurring (regularized deconvolution) of either the resized preview (in the darkroom) or the full picture (then, possibly disk-cached). This is just 2 FFT + discrete gradient, but still, it doesn’t need to be recomputed every time the darkroom preview is scrolled or zoomed.

What do you think ?


(Pascal Obry) #66

I’m no expert on this part. But how is this supposed to work on the pixelpipe. I mean if some iop before the deconvolution is changed, then the cache is invalidated and the deconvolution recomputed, right? How “fast” and “slow” is the implementations you’re talking about? How much data to be cached? If some hundred of megabytes, maybe cached in memory? And activated the module only if the computer has more than a given limit of memory?


(Aurélien Pierre) #67

How fast ? Well, it depends on:

  1. the size of the blur to remove
  2. the size of the picture
  3. the parameters used, because the program automatically stops once convergence is reached (thats’s new) to avoid looping for nothing, and the convergence speed depends on the settings

For now, let’s say 2 to 15 minutes for the whole process. I plan to run the initial blur estimation only upon user request (and store it in the database afterwards). The data to be cached would be the RGB values of the whole image as 32 bits floats, so by today’s standards, it would be 288 MB (24 Mpix) to 432 MB (36 Mpix).

But that might be unnecessary since the blur estimation and the debluring are now separated, it is now possible to scale both the picture and the blur kernel to run the IOP only on the preview at screen size. And for that, with Python I’m able to run under 1 min at HD size.


#68

if it cleans up on exit I personally will let it eat as much disk space as needed to save time.


(Aurélien Pierre) #69

Hi,

it’s been 1 year 4 months that I have been working on and off on that topic, and I have great news !

I have improved greatly the maths behind to make the algorithm converge 99.99 % of the time, and a lot faster than before (needs less iterations). Now, it’s truly 100 % auto-adaptative, meaning that it computes different metrics to update its inside parameters (to ensure convergence), hiding a lot of Ph.D-level stuff (Tikhonov regularization parameter, Cauchy distribution parameter, Sobolev space norm) to the basic user. This is a brand-new algorithm, combining several approaches I have seen in various papers, and it seems to perform very well in a various range of blurs, even in noisy conditions.

New features

  1. It allows to refocus (to a certain extent) on a specific area without affecting (too much) the other areas : especially usefull when there are different types of blur on the same picture (motion/focus/gaussian), now the algorithm tries to evaluate the blur in a user input area and only correct the zones where the real blur matches the evaluated one.

  2. It allows to chose the desired behavior between : denoise, deblur, or average both. The deconvolution is, by design, aimed at deblurring. Doing so, it adds more noise and amplify the one already there. So this algorithm regularizes (= denoise) and deblurs at the same time. The drawback is both are inverse phenomenons : if you regularize too much, you don’t deblur but you denoise instead. So, why not use the drawback of this method to actually denoise without (de)blurring ? Since the regularization parameter is optimized and refined automatically inside the solver, to take account of the variance (a metric of the noise amount) and the residual (a metric of the sharpness), we just have to tell the regularization optimizer to favor the variance or the residual, or average both, to adjust the regularization.

  3. Asks only 3 inputs : the size of the blur (pixels), the size/position of the sample window (to evaluate the blur), the sharpness/noisyness priority. That’s all. Everything else is estimated inside. Other parameters like the error tolerance are there too, but more as a clutch, to take back the control.

New performance

  1. The optimal inside parameters are now evaluated until convergence usually in 15 - 30 iterations

  2. Two different metrics are now used to stop the iterations before the solution degenerates :

    1. one ensures the noise created by the deblurring is white (in the sense of signal processing), so no patterns (ringing, fringing) are created. Since white noise looks natural, it’s a fair trade-off to allow some good-looking noise to get some more sharpness. This is done by computing the auto-covariance of the picture, ensuring it decreases monotonically, and stop the iterations when it increases back of a certain amount. The user can set the tolerance he wants on that amount (more tolerance = more sharpness, too much tolerance = :bomb:)
    2. the other ensures the solution is not stagnating, i.e the convergence is reached and it’s useless to continue.

Results

The code is still Python/Cython mixture, so it’s better than pure Python but still not as good as pure C. Don’t freak on the running times. However, from what I had 8 months ago, I have seen ×2 up to × 10 improvements, essentially because of the better convergence rate of the algorithm rather than the implementing (maths win). The code is fully parallelized (8 cores) running on a 3.6 GHz Intel Xeon laptop.

Denoise without deblurring : 24 Mpx, 11 min. (original on the left). Auto-covariance tolerance set to 1 %.

That one was processed (not by me) on Adobe Camera RAW from a poorly exposed shot, sharpened but not denoised. That’s a nightmare to correct. Also the colors are different because the original is a JPEG, probably with an ICC profile, and my code outputs TIFF and strips everything that is not a pixel.

Deblur without affecting the background : 16 Mpx, 8 min. (original on tfe left). Motion blur from the camera and/or the horse of 5 px.

At a more realistic zoom factor :

My face, corrupted with a synthetic gaussian blur and gaussian noise (std = 5). 4 Mpix, 45 sec.


Obviously, on that one, you create some grain because there is already noise.

Enjoy !


(Eylul / Azbulutlu) #70

This is amazing and the times as at least as a standalone process, really isn’t that much for its capacity of rescuing images. :slight_smile:


(Jean-Paul) #71

Bravo Aurélien, j’attends qu’il soit dispo dans darktable.


#72

Looks very promising to me. Hope this will become a new module in the future!