vkdt devel diary

CarVac · September 5, 2020, 12:29am

I’d love to implement Filmulator in vkdt…

afre · September 5, 2020, 12:59am

If only I had a proper GPU…

asn · September 5, 2020, 5:18am

You find Radeon RX 480 on ebay for 50-100€ with 8GB.

agriggio · September 5, 2020, 6:14am

I try to run this from time to time, I even upgraded my os to give it a chance (something I normally do only when I get a new machine…), but so far no luck
The basic test app ‘vkcube’ runs fine, but as soon as I try vkdt I get error messages about some features being unavailable in my intel card… Oh well.

hanatos · September 5, 2020, 2:25pm

sounds about right, yes. in all fairness, there are raw formats that take a whole lot longer to read than these particular RAF files. i guess i’m lucky to have one of the faster formats here.

right, why not. when working interactively this disk->cpu mem->gpu mem part is cached so you usually don’t notice it. but more speed is always better…

right, that may actually be more realistic in this framework than in stock dt because of the full-ROI processing. none of the api is finished or stable though.

and you’re saying that after we know the specs for a 3070RTX? but well, 50EUR is an argument.

hm which intel card is that? it’s not much fun to run on older/mobile intel, but last time i checked it still worked. i don’t think i meant to drop support for these, but some features are really nice to use (read/write image without format for instance). i was a bit too happy that my driver supported vk 1.2 and put that as a requirement in the code at some point, but i reverted that in the meantime.

agriggio · September 5, 2020, 4:12pm

I think that was what it’s complaining about I’m afraid…
Anyway, my card should be a intel 520 (what comes with a ThinkPad t460s) if that helps

heckflosse · September 5, 2020, 9:46pm

Well, that’s low end today…

johnny-bit · September 5, 2020, 9:58pm

Try painting on 36 mpix img in krita, you’ll feel the low end

hanatos · September 6, 2020, 6:50pm

…actually the more i think about it the more i like this challenge you’re running some sort of diffusion on the image + solvent density buffers, right? how many iterations do you usually do there? iterative algorithms with memory barriers in between might still be slow. say you need 100 iterations and we can maybe do 0.5ms/iteration it’s still 50ms total which would be not on par with the rest.

didn’t you have slides or something that explain the process? or could you maybe remind me of some details again? maybe we could do some of the diffusion iterations multi-scale, i.e. on reduced resolution buffers for increased speed?

i’d hate to do tiling/shared memory tricks, but that may be another way to push the iteration inside one kernel without going to global memory in between. but that would probably require significant overlap between the tiles. how big is your diffusion radius typically?

anyways, let me know if you really want to do this, i’d be excited to see a fast version of your filmulator!

CarVac · September 6, 2020, 7:09pm

There are only twelve iterations, so the main issue is the diffusion.

The standard deviation of the gaussian blur defaults to 1.3% of the image height, but currently it’s adjustable from 11.8% down to 0.1%. I don’t think I’ve ever used a blur larger than around 4% in actual use, though. Edit: I’m 99% sure that these blurs can be done scaled down, as long as values are conserved. It could even be permanently at a smaller resolution if step 3.2 can have values conserved despite mismatching scales.

The process goes as follows:

Convert raw data to linear RGB (I use sRGB exclusively in Filmulator) with common processing (black subtraction, CA correction, demosaic, capture sharpening, lens correction, white balance, color matrix, exposure compensation, eventually noise reduction,)
Prepare uniform value buffers for “crystal radius” (the size of the silver grains), “developer concentration”, “silver salt concentration”, and then use a basic transfer function with an adjustable toe and highlight rolloff to set up an “active crystals per pixel” buffer.
Loop the following 12x:
3.1 Increase the crystal radius according to the developer concentration and silver salt concentration
3.2 Reduce the developer concentration and silver salt concentration according to the volume of crystal generated (calculated from the crystal radius, the change, and the number of crystals per pixel)
3.3 Blur the developer concentration layer
3.4 Replenish some of the developer concentration with a reservoir according to an adjustable rate (Drama)
Convert the cross-sectional area of the crystals in each pixel to the output value
Apply a standard tone curve to brighten
Standard user-controllable brightness/contrast/saturation/curve/whatever controls

I really am interested in doing this.

hanatos · September 7, 2020, 10:35am

great! sounds simple enough. steps 1, 5, 6 sound like standard repertoire. not all of this is implemented or very well implemented… but we can probably take that as given.

so the additional memory you require is one image-sized buffer with 4 channels (radii, developer, salt, active crystals)?

the only local operation (which doesn’t only depend on a single pixel) would be the blur in step 3.3? in this case we could probably put everything into one kernel and then hand it to some generic blur block, such as this one: vkdt/api.h at master · hanatos/vkdt · GitHub . this should scale well to larger radii.

not sure how to best do the iteration though. one could explicitly instantiate 12 nodes in a chain (probably the easiest for a first try). the alternative is to use a cycle in the processing graph. this way you could output an animation and watch the image be developed

does this process converge? by depleting all the crystals/developer? or would more time allow for more diffusion and then a different look? or would discretisation into fewer timesteps mean larger blur (diffusion for longer delta t)?

in general this downsampled blur is not super efficient, because it doesn’t utilise the whole device. especially at the lower resolutions it’s not enough work to keep the GPU cores busy, but we still have to wait for the whole thing to come back before we can continue with the next iteration.

CarVac · September 7, 2020, 11:53am

The process isn’t allowed to fully converge; this isn’t like stand development where it depletes all the developer. In the current Filmulator, we sum up all of the developer diffusion in step 3.4 and subtract it from the reservoir. But that seems like it would be terribly inefficient on GPU so I would just skip it for this

More time would lead to more depletion, increasing the effect, but we leave that constant, instead changing the rate of replenishment of developer.

Using fewer timesteps has led to instability where bright pixels use up all of the pixel’s developer in one timestep, sending the developer concentration negative. We do clamp to zero just in case, though.

hanatos · September 7, 2020, 3:09pm

…trying to prepare a branch as a playground for you. to minimise memory traffic, i’m trying to keep things minimal. i’m working on the assumption that the data flow is something like:

init:
image → 4x aux vars (concentration etc)

main:
loop 12x:
aux → adjusted aux → blur

finally:
aux → image

in particular the main kernel and blur only affect the aux buffer, init converts the input image to aux variables, and the final pass converts aux variables to a colour image.

does that sound about right?

CarVac · September 7, 2020, 3:17pm

Yeah, the loop only operates on the various auxiliary variables.

hanatos · September 7, 2020, 4:17pm

okay, if you want to have a play, there’s a branch now: GitHub - hanatos/vkdt at filmulator . let me know if you have any questions.

perf is mediocre, daisy chaining all the blurs is just stupid slow. plus, i went for the quick thing and hardcoded 12 iterations in a fixed pipeline, so now even just uploading the uniforms and setting up the pipeline is slow. i may be able to improve that a little but maybe we’ll find a better way to do this.

(now i’ll shut up because discuss tells me my manners are bad and if at all i should write private messages instead.)

paperdigits · September 7, 2020, 5:59pm

???

afre · September 7, 2020, 6:01pm

Weird, it is your own thread. The warning shouldn’t be showing.

heckflosse · September 7, 2020, 7:11pm

@hanatos Can you please quote what discuss told you?

hanatos · September 7, 2020, 8:05pm

nah, sorry and now it doesn’t do it any more. something like i shouldn’t only reply too many times to the same person because others want to say something too. and that there’s private messages. very patronising for a computer program i thought

paperdigits · September 7, 2020, 8:15pm

The machines are on to your hijinx. They shan’t be fooled!