Possibly a new deconvolution module for Darktable

anon41087856 · December 15, 2017, 12:10pm

Some updates here. This horse picture above took me 1h to run 100 iterations on a 9 px blur.

I have found an algorithmic way to accelerate the convergence of the algorithm + I have done some more Cython optimizations. Now I run 705 iterations in 22 min with a 7px blur for the same result.

Plus I have changed my algorithm so that the blur is computed in a separate step from the picture, meaning that the PSF can be stored and saved for later with 75% of the job already done.

The work continues on the Darktable version, with new hope with these figures.

weltyj · December 16, 2017, 3:34am

Finally got back to look at the code. Looks like a lot of easy speedups possible (approx 10% improvements possible on the openmp nested for loops. Add a “collapse(2)” to the openmp #pragma lines to enable parallel processing on the second nested for loop. The DT gurus may have a reason not to do that, I’m just looking at it from a pure performance for the loops themselves.

Change this:
#pragma omp parallel for default(none) schedule(static)

to this:
#pragma omp parallel for default(none) schedule(static) collapse(2)

where you have nested “for” loops.

Also, I’m not familiar yet with details, but a lot of loops operate on 4 channels if image->c == 4. I wonder if the algorithm needs that or … should the max channels be something like MIN(3, image->c) ?

Great work so far

anon41087856 · December 16, 2017, 5:55am

thanks for your insight ! For the record, I’m not the author of the C implementation, so I’m not familiar with all the low-level details.

The algo doesn’t need the 4 channels, however it seems that in order to take advantage of the SSE instructions, you need to pass vectors of 4 floats (from what I understood, it makes better use of the CPU L1 cache for systematic operations).

I’m just at the beginning of my journey with C. For now, the “higher level” reference implementation is the Cython code, which show better the algorithmic structure of what I do with a mixed C/Python scheme and all the maths references and papers : Image-Cases-Studies/deconvolution.pyx at master · aurelienpierre/Image-Cases-Studies · GitHub and the API : https://github.com/aurelienpierre/Image-Cases-Studies/blob/master/richardson_lucy_deconvolution.py

weltyj · December 16, 2017, 1:34pm

You are doing well given no previous C experience, you are clearly understanding a lot.

Here’s a thought about coding specifically for the SSE instructions.
Since a lot of those loops are simple operations on array elements, here’s what I would do:

make another image structure element, c_used, then when creating the image data add

im->c_used = MIN(3,im->c) ;

In loops, switch the inner and outer loops so the “for(c= …”) becomes the outer loop. The compiler
should automagically do the SSE stuff for you on the inner loop, but I can’t put my finger on a speed comparison of hand-coded SSE instructions vs compiler generated SSE instructions.

for(c = 0 ; c < im->c_used ; c++)
{

  for(i = 0 ; i < im->size ; i++)
      operation for element[i+c]

}

If it turns out that the SSE instructions need to be hand-coded, you are set up for the next step of manually unrolling the inner loop (with another check to insure groups of 4 are unrolled, not exceeding im->size)

anon41087856 · February 22, 2018, 6:31pm

Hi @houz @hanatos @LebedevRI @Pascal_Obry,

I’m still working on this. I have now 3 flavours of the algorithm, from fast and nasty to slow and clean and I wonder : would it be a pain in the neck to cache the output of a module into a temp file on the disk ? Namely, the deconvolution by machine learning is performed in 2 steps:
1- estimation of the blur (and a sharp image) over a sample of the picture (255×255 px or so). Then, the parameters are stored (a square array of floats, between 3×3 and 33×33). This is the most consuming part (4 FFT + discrete gradient / iteration)
2 - deblurring (regularized deconvolution) of either the resized preview (in the darkroom) or the full picture (then, possibly disk-cached). This is just 2 FFT + discrete gradient, but still, it doesn’t need to be recomputed every time the darkroom preview is scrolled or zoomed.

What do you think ?

Pascal_Obry · February 22, 2018, 8:22pm

I’m no expert on this part. But how is this supposed to work on the pixelpipe. I mean if some iop before the deconvolution is changed, then the cache is invalidated and the deconvolution recomputed, right? How “fast” and “slow” is the implementations you’re talking about? How much data to be cached? If some hundred of megabytes, maybe cached in memory? And activated the module only if the computer has more than a given limit of memory?

anon41087856 · February 22, 2018, 8:50pm

How fast ? Well, it depends on:

the size of the blur to remove
the size of the picture
the parameters used, because the program automatically stops once convergence is reached (thats’s new) to avoid looping for nothing, and the convergence speed depends on the settings

For now, let’s say 2 to 15 minutes for the whole process. I plan to run the initial blur estimation only upon user request (and store it in the database afterwards). The data to be cached would be the RGB values of the whole image as 32 bits floats, so by today’s standards, it would be 288 MB (24 Mpix) to 432 MB (36 Mpix).

But that might be unnecessary since the blur estimation and the debluring are now separated, it is now possible to scale both the picture and the blur kernel to run the IOP only on the preview at screen size. And for that, with Python I’m able to run under 1 min at HD size.

Magnade · February 23, 2018, 2:43am

if it cleans up on exit I personally will let it eat as much disk space as needed to save time.

anon41087856 · July 24, 2018, 1:18pm

Hi,

it’s been 1 year 4 months that I have been working on and off on that topic, and I have great news !

I have improved greatly the maths behind to make the algorithm converge 99.99 of the time, and a lot faster than before (needs less iterations). Now, it's truly 100 auto-adaptative, meaning that it computes different metrics to update its inside parameters (to ensure convergence), hiding a lot of Ph.D-level stuff (Tikhonov regularization parameter, Cauchy distribution parameter, Sobolev space norm) to the basic user. This is a brand-new algorithm, combining several approaches I have seen in various papers, and it seems to perform very well in a various range of blurs, even in noisy conditions.

New features

It allows to refocus (to a certain extent) on a specific area without affecting (too much) the other areas : especially usefull when there are different types of blur on the same picture (motion/focus/gaussian), now the algorithm tries to evaluate the blur in a user input area and only correct the zones where the real blur matches the evaluated one.
It allows to chose the desired behavior between : denoise, deblur, or average both. The deconvolution is, by design, aimed at deblurring. Doing so, it adds more noise and amplify the one already there. So this algorithm regularizes (= denoise) and deblurs at the same time. The drawback is both are inverse phenomenons : if you regularize too much, you don’t deblur but you denoise instead. So, why not use the drawback of this method to actually denoise without (de)blurring ? Since the regularization parameter is optimized and refined automatically inside the solver, to take account of the variance (a metric of the noise amount) and the residual (a metric of the sharpness), we just have to tell the regularization optimizer to favor the variance or the residual, or average both, to adjust the regularization.
Asks only 3 inputs : the size of the blur (pixels), the size/position of the sample window (to evaluate the blur), the sharpness/noisyness priority. That’s all. Everything else is estimated inside. Other parameters like the error tolerance are there too, but more as a clutch, to take back the control.

New performance

The optimal inside parameters are now evaluated until convergence usually in 15 - 30 iterations
Two different metrics are now used to stop the iterations before the solution degenerates :
1. one ensures the noise created by the deblurring is white (in the sense of signal processing), so no patterns (ringing, fringing) are created. Since white noise looks natural, it’s a fair trade-off to allow some good-looking noise to get some more sharpness. This is done by computing the auto-covariance of the picture, ensuring it decreases monotonically, and stop the iterations when it increases back of a certain amount. The user can set the tolerance he wants on that amount (more tolerance = more sharpness, too much tolerance = )
2. the other ensures the solution is not stagnating, i.e the convergence is reached and it’s useless to continue.

Results

The code is still Python/Cython mixture, so it’s better than pure Python but still not as good as pure C. Don’t freak on the running times. However, from what I had 8 months ago, I have seen ×2 up to × 10 improvements, essentially because of the better convergence rate of the algorithm rather than the implementing (maths win). The code is fully parallelized (8 cores) running on a 3.6 GHz Intel Xeon laptop.

Denoise without deblurring : 24 Mpx, 11 min. (original on the left). Auto-covariance tolerance set to 1 %.

That one was processed (not by me) on Adobe Camera RAW from a poorly exposed shot, sharpened but not denoised. That’s a nightmare to correct. Also the colors are different because the original is a JPEG, probably with an ICC profile, and my code outputs TIFF and strips everything that is not a pixel.

Deblur without affecting the background : 16 Mpx, 8 min. (original on tfe left). Motion blur from the camera and/or the horse of 5 px.

At a more realistic zoom factor :

My face, corrupted with a synthetic gaussian blur and gaussian noise (std = 5). 4 Mpix, 45 sec.

Obviously, on that one, you create some grain because there is already noise.

Enjoy !

eylul · July 28, 2018, 12:16pm

This is amazing and the times as at least as a standalone process, really isn’t that much for its capacity of rescuing images.

jpg · July 28, 2018, 3:33pm

Bravo Aurélien, j’attends qu’il soit dispo dans darktable.

Lichttisch · July 28, 2018, 6:24pm

Looks very promising to me. Hope this will become a new module in the future!

asn · October 20, 2018, 8:54am

Hey Aurélien,

what happended to that module?

anon41087856 · October 21, 2018, 12:44am

Hi Andreas !

One year ago, Edgardo (the dev behind the retouch module) did a prototype of a dt module, based on my Python code, that is functional but too slow for real life use (not his fault, that’s the algo).

So I tried to adjust the maths to make it converge faster. It does converge better in some cases, but blows away in some others. I’m still working on it to make it more robust. I have been in touch with a german researcher to improve the convergence (ein anderer Deutscher : ich weiß nicht, was Sie Deutschen mit Bildverarbeitung haben ) .

Now, I’m limited by Python (super slow & no real multithreading) so I’m looking forward to continue my work in full C. In the grand scheme of things, my first 2 modules now merged into master were only a way to get my hands inside darktable sourcecode with easier projects. With the knowledge I have acquired, I’m ready to continue my work on the deconvolution.

martin.scharnke · October 21, 2018, 2:42am

Thank you for your dedication!

anarcat · December 4, 2018, 3:26pm

Salut!

That is some truly amazing work. I’ve been comparing results between DT and LR for some pictures and noise reduction is one of the areas where DT sometimes falls short and produces blotchy images compared with the sharp images LR can make. (Sometimes it’s fine too of course!)

Your work seems really promising, but the performance issues seem to be a major blocker. Do I understand correctly that the DT module runs in 10 minutes? I understand it’s a huge improvement compared to the research material (which counts in hours), but it seems to me this couldn’t be used as a basic module that we would enable lightly, unless performance significantly improves.

Is there a target performance that could make this usable as a normal DT module? Say sub-minute rendering?

Thanks again for all your work!

anon41087856 · December 8, 2018, 4:29pm

yes, basically, as for many great image processing algorithms, the real-life performance is the main barrier between promising research papers and general use implementation.

My latest work on this has been to accelerate the numerical convergence, and this has turned into a research project rather than the simple “paper to code” project it use to be. Very time-consuming…

ari · December 9, 2018, 5:58pm

Wondering if this module would be opencl-friendly. If it is, 10min run without gpu assistance would perhaps mean sub-minute range once openCL + cpu multithreading running in parallel - with a half-decent gpu. I’m seeing 600% overall improvement with some modules like profiled-denoise with an amd rx560

anon41087856 · December 9, 2018, 6:01pm

Of course, OpenCL would do great here, although FFT implementation in OpenCL is something I have never done.

maf · June 10, 2019, 12:53pm

@anon41087856 I’m impressed by this research and attempt to make it a workable darktable module. The results you’re showing are very pleasant to my eye especially the portrait of yours.

I hope that in the near future you, or someone else, will succeed it this work. I have a lot of slightly blurred photos of my children I it would be nice feature to have.
Thank you!