vkdt devel diary

because it’s been a while, i just wanted to give a brief update on what’s happened in vkdt. it’s not much but may be of interest to a person or another. all of this is, as usual, super rough. i’ll post some better example images once i stabilise some of these features.

local contrast refinements

i’ve experimented a bit with doing local laplacian pyramid style contrast enhancements in log space. i think the results are a lot more pleasing especially near the blacks (doing it in linear will result in drowning some values, leading to clipped blacks). will post examples, but now i became greedy and want to fix another thing or two with the current curve. maybe i can demonstrate even prettier results soon.

colour module

i implemented a cursory input colour module, which can employ the camera’s white balance coefficients and matrix. since this might shoot colours out of gamut, there is a layer of radial basis function (RBF) right after that, which is used for gamut mapping (either to rec709 or whatever colour space you want, or to the full spectral locus approximated by a polygon):


in this graph, the spectral locus (green) is approximated by a low vertex poly (purple) and the XYZ gamut (light blue) is projected inside it (yellowish). the somewhat unusual shape is because i’m mostly running in rec2020 and so the gamut is considered in rec2020 red/blue space instead of xy chromaticity coordinates. in contrast to dt, the RBF here is 2d, so it’s useful for colour correction, but not for general artistic/colour zones style manipulation (in particular you can’t condition changes on brightness).

while implementing this i noticed that the RBF kernel used by darktable-chart produced really bad results. the matrix has quite a few 0 entries (for both distances of 0 and of 1), which made my radial basis functions behave really strange. because they are fun to look at, here are some shots of my debugging test case:




note that the view in the top right is the vanilla waveform histogram… it just happens to look like a curve widget because of how the colours are arranged in the main image. anyways using dt’s kernel would not keep the nodes at (0,0) and (1,1) fixed in this view. i’m suspecting darktable-chart might improve matching performance using a different kernel. will try at some point.

of course i want to use this to correct whatever mess the matrix made and pull the source colours to the correct ones whenever i know the correct values (say IT8/darktable-chart kindof thing). for this i need to pick colours on images.

colour picker

this is one nightmare in a cached environment such as darktable’s processing pipeline. in vkdt i don’t have a pipeline, but a graph (DAG). so what i do is attach a picker module anywhere i like:


this makes sure the data is always up to date. well, granted you pass the “upload sink to host” runflag to the graph. usually i’m reluctant to copy data from GPU->host, because slow.

for the picking itself, i’m autogenerating a gui (remember the gui is strictly separated from the core, the module just comes with a few textual annotations of what parameters there are). this means it’s not smooth to use (to say the least) and ugly and redundant and wastes space. will fix with low priority at some point. that said, you can pick a variable number of spots:


these colour buttons are std imgui arsenal, they show a tooltip with numeric values if you hover over them. (chose this synthetic image because it has fluorescent colours stored in XYZ colour space).

so much for now, hope to find time to polish some of this to make it actually useful.

12 Likes

I am super interested in this, and I have been doing a lot of work on local laplacian pyramids over the past weeks. You might be interested in having a look at my C++ code: https://github.com/aferrero2707/PhfMerge/blob/master/src/phf_llf.cc
It is ugly, slow and badly commented, but the results are IMHO very nice.
It also works in log space, as recommended by Paris in his original paper… I am planning to make a post to describe the code a bit and give some examples, but feel free to ask me questions if you are interested. I’ll try to compare your code and mine and see if/how they differ, but I am afraid you’ll be faster in reading my C++ code and I will be in interpreting your GPU implementation, and I am a completely newbie at GPU programming…

In the past I managed to compile and run vkdt, so it is definitely time to try again!

I hope you don’t mind if I try to give my explanation of why log is better… so here it goes:

The algorithm splits the scene into “edges” and “texture” according to a threshold parameter. When compressing the dynamic range, edges are reduced in amplitude but the local gradient value is preserved (if I understand correctly). Any fluctuation smaller than the threshold is considered as “texture” and preserved in the dynamic range compression (or amplified when doing local contrast enhancement).
The key point is: what is this threshold parameter? If you work with linear data, the threshold corresponds to a difference in pixel values. However, in dark areas the pixel differences are typically smaller than in bright ones (think of two color checkers illuminated at different light intensities), hence the threshold is applied differently.
In log space, differences are equivalent to linear ratios. Hence, the threshold applied in log space selects edges according to ratios of pixel values, not differences. Taking again the color checker example, this means that the transition between two patches would be considered either and “edge” or a “texture” independently of the absolute brightness.
This results in an homogeneous effect over the whole intensity range…

ah thanks for the link! will definitely have a look. i think i recognise some of the comments, so it might indeed be doable to understand it :slight_smile:

interesting explanation wrt log space. i don’t quite remember the paper in much detail… my impression back when was that the tonemapping application first does a log shaper and then applies the local contrast. i’m now going log space -> apply contrast -> go back to linear. my simple observation was that a dark value might be pushed below zero when pushed away from a local average value that is slightly brighter. this results in quite harsh transitions, dark areas drown early on. in log space, log(zero) is very far away indeed ™. so pushing dark values will smoothly transition them towards zero, but not let them cross zero.

that said my contrast curve is still not optimal because it introduces gradient reversals for large contrast values. i’m hoping i can find one that doesn’t cause banding because of high frequency content and also doesn’t reverse gradients.

nice, i managed to build your utility (after adding #include <cassert>, will do some comparisons. here are two of my curves:

[image by william clark, i think i got it here https://retouchingacademylab.com/free-raw-files-for-retouching-practice/]


top: old version with harshly clipped shadows (see eye and neck shadow boundary).
bottom: log version with smoothly fading out shadows. okay, overall the look is a bit goth maybe. but i’m more worried that pretty much all other controls (highlights + contrast) don’t work so well in this version. i really like the soft shadow darkening though. still work to do.

curious to see your result. your utility takes linear tif as input? unfortunately running it gives me a heap buffer underflow error from address sanitizer before it outputs a result.

Cool! Here is what I get on the same image (but not the same raw processing):

Input:


RA_William_Clark-2-small.tif (8.0 MB)

Local contrast enhanced (same remapping function for all scales):

phf_llf -c 1 -C 2 2 -t 0.5 -o RA_William_Clark-2-llf-lc2.tif RA_William_Clark-2-small.tif`

Local contrast enhanced (with decreasing strength from the coarsest to the finest scale):

phf_llf -c 1 -C 5 1 -t 0.5 -o RA_William_Clark-2-llf-lc.tif RA_William_Clark-2-small.tif`

Both are intentionally exaggerated, to show the effect of applying a different local contrast strength at different scales…

It requires (and currently assumes) the input image to be a 32-bit floating-point TIFF in linear sRGB. Probably an EXR would work as well, but I have not tried yet…

aha! that works, thanks. let’s say it takes “a tad” longer (six minutes in it’s still processing level 5/12, vkdt: 17ms for the whole pipeline including demosaicing etc). but this seems workable, will try to get similar shadow experiments going!

As always, thanks for sharing and for the pretty pictures and figures.

1 Like

You are the fast gui, I’m the old one :wink:

haha, i don’t think this is true but thanks :slight_smile:

Let’s keep away from the other stuff in pipeline, just talk about demosaicing:

What’s the size of your input and output for demosaicing?
Did you also count the time to transer data from main memory to GPU memory?
What demosaic algo did you use?

some timings:

[perf] demosaic_down:	   0.113 ms
[perf] demosaic_gauss:	   0.129 ms
[perf] demosaic_splat:	   0.954 ms
[perf] demosaic_fix:	   1.585 ms

this last step is like the median filtering colour smoothing iterations often done after demosaic, so it’s optional.

Image Width                     : 6034
Image Height                    : 4028

the algorithm is something similar to the super resolution gaussian splatting in the google paper, but i’m not very good at implementing stuff 1:1. i like it because it has potential to extend it to image stacking/super resolution, and it also works for xtrans with very minor changes (i run the same code). but you know i’m no expert for demosaicing and my eyes are too bad to tell the pixel level differences.

input size = output size, not sure what you mean there? just output has 3 channels per pixel.

i’m not transferring any memory between CPU and GPU, that would be stupid (this is the whole point of doing a full GPU pipeline). the raw is uploaded once and then the pixels never leave the device (other than for the monitor or once you export to file).

fwiw the hard drive loading, decoding and transfer i believe are summarised here:

[rawspeed] load /home/jo/Pictures/RA_William_Clark-2.dng in 375ms
[perf] i-raw_main:	   3.718 ms

to rawspeed’s defense, the RAF of my fuji load in about 35ms (or 9ms once the disk caches are in ram).

i’m no GPU expert and in particular the demosaicing i didn’t spend much time on it. i’m sure clever or capable people could get more out of it (both in terms of quality and speed).

1 Like

(just noticed that out of these 17ms i spend 6ms on the CPU preparing the pipeline, which should probably be optimised a lot, as well as 2ms computing a histogram. man these things are expensive!)

I thought about about binning, as iirc you you did that in past, and would reduce output size compared to input size by factor 4

How do you get the raw data from the file into GPU memory without that?

well yes obviously i transfer once. but not every time. i’m running the full pipeline every time because i’m too lazy to implement some sort of graph cut, but i do cut after the upload :slight_smile:

that would be the ~3ms quoted above, spent in the input module i-raw.

How does that work out with really large (>= 100 MP) raw files? What amount of GPU memory is needed for those?

hm, didn’t we talk about this before? what happened? you sent me an example file and i forgot to get back to you? my memory is bad, sorry.

i can print the memory stats for this 24mp image when i’m back at this pc. 100mp is going to be roughly 4x that. for my part i’m fine with people with crazy photo cameras to require crazy processing hardware too…

right. i did clean this up. you can still force it with the LOD slider and i do it for thumbnails, but by default i run full res now.

so for this pipeline with 24MP input:

[mem] images : peak rss 928.516 MB vmsize 1157.11 MB
[mem] staging: peak rss 235.264 MB vmsize 235.264 MB

these are different types of memory, the staging one is visible to host (for up/downloading). still you’ll need to sum the numbers. so i guess 1.4Gx4 would take you to around 6G requirement for the 100MP images. not unreasonable to ask from a high-end GPU these days i think.

i’m happy to try running on any image if you have one that’s supported by rawspeed and doesn’t require fancy extra treatment.

speed wise i would expect it to be near linear in the number of pixels, potentially faster than that (the graph often has to wait for small kernels to finish which don’t occupy all units of the GPU). then again i’m mostly bandwidth bound, so more pixels means slower. that said, a 4x speed impact starting from current baseline i still won’t bother implementing sub-region support.

okay i think i have something workable for the local laplacian pyramids now. i used another image of the same source above, i like it for the awkward pose which looks even better when using crazy extreme values in an attempt to break the algorithm.

neutral:

reduced clarity:

pushed clarity to extreme:

shadows 0 highlights 0:
(see how some detail in the spec on the nose is recovered. also some minor shadow lift can be observed)

reduced clarity and a lot of shadows:
(this time it doesn’t drown in clipped black, but black softly takes over)

also, i tested a high dynamic range landscape shot (it was a playraw on this forum):

straight:


with log curve:

with log curve and reduced highlights and shadows, and increased clarity (may have gone overboard with the settings again for demo, clouds look fake):

i think i’m getting closer here, but still don’t want to ship this as a final thing. there’s still the colour reconstruction from the brightness values that are adjusted here. there are very many options here and i’m not yet sure there is a one-fits-all solution (apply to each channel, scale rgb according to yo/yi, scale based on some saturation value estimated from an estimated contrast value, if you do that, how do you estimate the contrast for something that isn’t even a curve, …). in particular this seems to be true for these two images (well lit portrait strictly in [0,1] and hdr that goes [0,large] if you expose for some mid tones).

5 Likes