Possibly a new deconvolution module for Darktable

heckflosse · November 25, 2017, 7:41pm

Exactly. For that reason I mentioned:

1. these are completely different approaches.

sankos · November 25, 2017, 8:38pm

Thanks – so it boils down to the precision. It reminds me of the DxO noise reduction algorithm called PRIME – you’ve got to wait for it, but sometimes it’s worth it. I think it’s useful to have a choice between the quick-and-dirty and the slow-and-pretty tools, so keep up the good work.

anon41087856 · November 27, 2017, 1:48am

Quick update, I’m now able to perform 220 iterations with a 9 px kernel on the 16 Mpix horse picture above in 1h40, and with a far better noise detection over the RGB channels. The former version took more than 1h for 100 iterations with a 5px kernel. (No other sharpening/local contrast enhancement used here) :

Now my base routines (convolution product and gradients estimations) are written in plain Cython (Python compiled in C), and multithreaded on 8 cores. Most of the array operations still use Numpy and Python interpereter though, making the max CPU usage around 80 and on average at 70 (the 20-30 % remaining are lost in I/O and Python gloabal interpreter lock I think).

For reference, this is the source image :

anon41087856 · November 27, 2017, 12:20pm

And this was 4 hours of computations. I don’t know if it is possible to make it better, the blur seems uneven, with lens distorsion. Maybe I will try to split the image in 4 to get unifom areas. 870 iterations.

anon41087856 · December 7, 2017, 8:44am

@hanatos @houz the code is now inside a DT module (early pre-alpha/crappy code : GitHub - aurelienpierre/darktable at rlucy) and the 12 Mpx waterfall picture above that took me 2h30 with Python takes me now less than 10 min in DT from the RAW (the one that took 4 hours takes now 40 min).

Now, I would like to play with tiles. I know tiles are an option in DT when RAM is lacking, but is there any way to force the use of (let’s say) 9 tiles ? This would be an easy option to compute spacially-variant PSF.

Pascal_Obry · December 7, 2017, 8:51pm

You don’t seem to have OpenMP activated yet. So there is still lot of CPU cycle to gain. Right?

anon41087856 · December 7, 2017, 8:54pm

I have OpenMP pragma written on for loop but… I’m discovering C + GUI coding + GTK altogether as I’m programming, so I have no precise idea of what I’m doing out of the maths details. Do you think OpenMP is not running ?

Pascal_Obry · December 7, 2017, 9:36pm

Yes I think so, in your code I see that you’re checking for _OPENMPx, for example:

#ifdef _OPENMPx
#pragma omp parallel for default(none) shared(im, mult) schedule(static)
#endif

Looks like the code has been disabled on purpose. Not sure why, maybe not ready?

hanatos · December 8, 2017, 4:20pm

amazing what difference the language alone makes. seems encouraging.

the code does indeed not look very performance oriented at all yet

all the boiler plate image struct + image operations look like they would introduce a fair bit of cache thrashing and unnecessary memory allocations. i mean things like this:

    rlucy_img_mean(image, mean, time);
    rlucy_alloc_image(&img_tmp1, image->w, image->h, image->c);
    rlucy_copy_image(&img_tmp1, image);
    rlucy_img_subtract_1(&img_tmp1, mean, time);
    rlucy_img_norm(&img_tmp1, norm, time);

(also i disagree with the code comments that seem to indicate if this kind of api is filled with calls into blas things would be better)

re: tiling. i don’t think you want to rely on our tiling engine. the requirements when buffers are tiled and what would happen with the tiles are very different and may lead to unexpected behaviour if the pixel pipeline engine changes in the future.

anon41087856 · December 8, 2017, 7:50pm

the time variable are use for benchmarking during dev. What do you suggest for basic array operations (subtraction, multiplication, etc.) ? BLAS was my first thought because it’s the base of LAPACK and is already threaded and optimized.

Also, the pixelpipe has 4 channels, I expected 3 (RGB), what is the fourth ? Do you have some sort of iop api documentation ?

LebedevRI · December 8, 2017, 8:05pm

Spell all that out manually.
Less calls → less dependencies → less copies → cleaner code → more breathing room for compiler → performance.

Padding, basically. You can use it in module, but at the end it needs to be copied

github.com

darktable-org/darktable/blob/b9a2634e75da069e5b9226a2156ac1e03c94c8f6/src/iop/exposure.c#L408


      
            const int ch = piece->colors;
          
          #ifdef _OPENMP
          #pragma omp parallel for SIMD() default(none) schedule(static)
          #endif
            for(size_t k = 0; k < (size_t)ch * roi_out->width * roi_out->height; k++)
            {
              ((float *)o)[k] = (((float *)i)[k] - d->black) * d->scale;
            }
          
            if(piece->pipe->mask_display & DT_DEV_PIXELPIPE_DISPLAY_MASK) dt_iop_alpha_copy(i, o, roi_out->width, roi_out->height);
          
            for(int k = 0; k < 3; k++) piece->pipe->dsc.processed_maximum[k] *= d->scale;
          }
          
          #if defined(__SSE__)
          void process_sse2(struct dt_iop_module_t *self, dt_dev_pixelpipe_iop_t *piece, const void *const i,
                            void *const o, const dt_iop_roi_t *const roi_in, const dt_iop_roi_t *const roi_out)
          {
            const dt_iop_exposure_data_t *const d = (const dt_iop_exposure_data_t *const)piece->data;

hivemind in #darktable IRC channel

Ger_Siemerink · December 10, 2017, 1:19pm

compiling it gives an error

Ger_Siemerink · December 10, 2017, 1:20pm

common/rlucy_deblur.c:23:10: fatal error: cblas.h: does not exist
#include <cblas.h>
^~~~~~~~~
compilation terminated.

anon41087856 · December 10, 2017, 10:25pm

Yeah you have to install the OpenBLAS lib to compile but this dependance will be removed on Roman’s recommendation. Or you can just delete the line 23 of common/rlucy_deblur.c, because the lib is not used anyway.

anon41087856 · December 12, 2017, 5:43am

@LebedevRI are you really sure about BLAS though ? Because my code is like 75 matrix multiplications, 25 fft convolutions, and CPU vendors have both highly optimized BLAS versions (Intel MKL, AMD ACML) that seem really efficient :

or :

http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/

or :

I could really use the ×15 speed-up. Plus it’s less subroutines coding. Plus, there are already OpenCL versions :

LebedevRI · December 12, 2017, 8:37am

What Jo said. I will be rather surprised if you won’t get at least 2x speedup just by getting rid of most of that indirection.

anon41087856 · December 12, 2017, 12:30pm

Of course the general structure of the code will be optimized too and the copy removed, this was just a first draft from Edgardo based on my Python code.

I was thinking to split my module in 2 : since the regularization method in the deconvolution is basically denoising with gradients/total variation, I thought I would code the gradients lib into a denoising module, optimize & test it and then reuse it in the deconvolution. That would allow me to work on smaller chunks of code while I still figure out C/GTK/DT pipe. What do you think ?

The reference for the denoising method is here :

hanatos · December 12, 2017, 6:41pm

first, i really very much don’t think your code is going to run faster using blas as compared to doing individual optimisations or just starting with a careful implementation. second, depending on blas may hinder merging this.

re: denoising. did you try replacing that by anything we already have in darktable? i’d be reluctant to accept yet another denoising technique, especially since this one seems to come from quite a different processing domain. it seems unclear to me that it performs any better than existing approaches more tailored towards rgb images and the kind of noise prevalent in digital cameras.

anon41087856 · December 12, 2017, 7:12pm

re: denoising. did you try replacing that by anything we already have in darktable?

Not possible. The reason is mathematical : natural sharp images show sparsity in their gradients, and their log(L1_norm(Total Variation)) follows a Cauchy’s distribution. So the noise optimization/deconvolution regularization relies on this property (and the associated probabilities, through the Baeysian framework) to remove artifacts, leading to piece-wise smooth areas. This has been showed by Antonin Chambolle (2003), Anat Levin (2007) and Paolo Favaro (2015). Chambolle denoised with TV computed on forward first-order finite differences in 2D (2 neighbours). Favaro did so with centered first-order finite differences (8 neighbours). I modified his work with a 3D spectral-spatial centered second-order difference using the paper above (18 neighbours), and the regularization is way better (from my tests).

Generaly speaking, the Total Variation is just a way to compute the likeliness of a pixel of being noise and compute a penalty to damp it. Aside from the hyperspectral context, the 3D TV in the paper shown above is just a way to assume that real edges should show the same gradient along the 3 RGB channels, and use info from the 3 channels to identify more accurately what is detail and what is noise. If a pixel doesn’t have the same gradient over the 3 channels, we assume it’s noise.

Besides, photographically speaking, I just don’t like the ouput of the profiled denoise

Ger_Siemerink · December 13, 2017, 8:06pm

thanks Aurélien, will try!