processing that sucks less?

hanatos · June 17, 2019, 1:18pm

hi

frustrated with heavy dependencies and slow libraries, i’ve been experimenting with some game technology to render raw image pipelines. in particular, i’m using SDL2 and vulkan. to spur some discussion, here is a random collection of bits you may find interesting or not.

also please note this is just a rough prototype bashed together with very little care and lots of hardcoded things just to demonstrate what’s overall possible or not.

in case the video doesn’t play, here’s a still:

(thanks to andreas for the raw, i stole it from play raw here)

brute force processing the full raw (well, half size because i don’t have any demosaicing), this runs at vsync/has tearing because in fact there is no vsync and it’s too fast. these are some performance counters from a GTX 1080 (intel HD 5500 is 100x slower), but i don’t trust the numbers:

[pipe] query demosaic:   0.0031 ms
[pipe] query exposure:    0.002 ms
[pipe] query filmcurv:   0.0031 ms

in any case it seems clear that this is just the time it takes to carry the image through the compute shader pipeline, the gpu is completely unimpressed by the actual compute done.

so far this is implemented as a generic node graph, which can output dot files like this:

graph

and every module is defined by a couple of text files, namely defining connectors:

input:read:rgb:f32
output:write:rgb:f32

and module parameters with annotations for gui generation:

x0:float:1:0.0:0.0:1.0
x1:float:1:0.2:0.0:1.0
x2:float:1:0.8:0.0:1.0
x3:float:1:1.0:0.0:1.0
y0:float:1:0.0:0.0:1.0
y1:float:1:0.2:0.0:1.0
y2:float:1:0.8:0.0:1.0
y3:float:1:1.0:0.0:1.0

and a compute shader which is then automatically compiled into a vulkan command buffer, one compute pipeline per node. the gui is immediate mode and uses dear imgui for the slider widgets. in fact the image is drawn this way, too, so the output of the compute shaders never leaves the gpu. if you drag a slider the raw stays on the device and only the rest of the pipeline is executed and the result displayed. added benefit: 30-bit/pixel setups should be straight forward to support.

the setup we’ve been looking at above comes from this config file:

module:rawinput:01
module:demosaic:01
module:exposure:01
module:filmcurv:01
module:display:01
connect:rawinput:01:output:demosaic:01:input
connect:demosaic:01:output:exposure:01:input
connect:exposure:01:output:filmcurv:01:input
connect:filmcurv:01:output:display:01:input
param:exposure:01:exposure:2.0
param:filmcurv:01:y2:0.8

in the video i’m using a fake demosaic module, exposure, and a fake filmic curve remotely similar to what aurelien has done for dt.
to give you an idea how advanced (or not) this is, here’s some screenshot from debugging the parametric curve (monotone hermite spline) with python:

i really like the performance i can get out of this, and i also like how there’s only one code path (glsl shaders) as opposed to three (i386, sse, opencl). seems these 2D image processing things map extremely well to GPU shaders, even on my 5yo intel laptop. i mean this in contrast to how well our opencl code path would have worked on this device. i’m also quite happy to get rid of a ton of dependencies on the way.

let me know your thoughts, i’d be interested in anything related to new and faster pipeline/ui.

i’d attach my fake-filmic glsl code just so you could have an impression what it would be like to write iop in such a framework, but it seems shader code is not among the allowed file types. let me know if you’re interested and i’ll paste or so.

Carmelo_DrRaw · June 17, 2019, 2:38pm

That looks really interesting and impressive! I would be definitely interested in your sample code.

From your sketch of the node graph I get the idea that you can already inter-connect tools in arbitrary order. Is that the case? How generic is the node graph? How difficult would be to introduce grayscale opacity masks?

If I can get a GPU alternative of the CPU-based VIPS pipeline, then I am ready to put lots of efforts to help developing it!

hanatos · June 17, 2019, 6:25pm

nice, thanks for the offer! to be useful, this would require a fair bit of work indeed.

yes, nodes can be connected in any order. there are a few conditions that have to be met such that an output pin fits the input pin (pixel format mainly).

grayscale opacity masks should be easy. if you read the mask from file, you add another input node and create a mask node with two inputs (colour image + mask). as it turns out, my cheaper GPU only supports two and four channel images, so i’m carrying RGBA along all the time anyways, so there is room for a mask channel (similar to darktable). of course you could create the mask from drawn input or programmatically, as the parametric blending in dt. i did not implement anything in this direction, but the processing graph supports it.

at LGM, we talked about ROI + context buffers or allocating padded input. ROI rendering and tiling for very large buffers is a planned feature but currently there is no code to support it (mainly comments and some unused data in structs).

since you worked intensively with a graph-based library i’d be interested in your feedback on the API.

on a high level, there are modules that exchange image buffers and have user parameters. on a lower level, these modules can spawn individual nodes: each node corresponds directly to a shader kernel. one example would be my fake filmic: https://jo.dreggn.org/main.comp , which only has one node for the module. but there could be several (demosaic: interpolate green and then interpolate red/blue, say). these nodes are automatically connected following the module connections (see config snippet above how these are connected), and the shader code will be executed after topological sort of the graph, resolving dependencies and managing temporary memory on the way.

other than the textual config files and the shader kernel, each module can have a small set of other special callbacks, for instance for the ROI negotiations, for the input to load the raw and the export to write the output to disk. i think i want to massage the code a bit more before i’d release it to the general public… i could prepare a private preview though.

one more caveat: i developed this on debian sid linux, i’m quite sure it’ll not run on anything else. i know the vulkan code runs on windows with minor modifications. however, i’m sure it’ll be a major pain to try and port it to macintosh computers (no vulkan, only via moltenvk, i know nothing about it).

Carmelo_DrRaw · June 21, 2019, 11:35am

In my code I am taking a different approach: each tool has various code paths depending on the input data format (grayscale/RGB/Lab/CMYK). The path to be followed is determined at the time when the pipeline is built. If the input is not supported, no processing is applied and the input is returned as-is.

In my experience, the RoI processing has two major drawbacks, due to the fact that the RoI grows when you walk down the pipeline and you have filters that require some padding (like blurs):

over-computation: if the output RoIs are computed independently, the adding pixels need to be re-computed several times
memory requirements: if several filters requiring large padding are chained together, the required tile borders add up linearly, and at some point the required RoI at some intermediate stage might become much larger than the one that finally needs to be computed

In both cases, the solution that I adopted was to insert intermediate tile caches, that are designed to minimise the need for thread synchronisation and so limit their impact on processing performances.
The tile caches are automatically inserted by the code that builds the processing pipeline, whenever the required padding exceeds a certain threshold.

I develop on macOS and I need a macOS version, so I’ll certainly be forced to look into solution for porting the code…

hanatos · June 26, 2019, 2:12pm

aha… have to think about the different format thing with automatic translation. shouldn’t be hard to put something like this into place. on the other hand some modules make little sense in the wrong format (running only on luminance when they expect colour or so). in this case i’d prefer an error message i think.

re: ROI: i want roi mainly for processing of only one viewport, e.g. when zoomed in a lot. what you are talking about is what we call “tiling” in darktable, i.e. processing a lot of independent roi one after the other to assemble the full image in the end. and yes, this is wasteful and only a last-resort thing that we use if we absolutely cannot fit the necessary buffers into memory. i didn’t implement this so far, the idea would be to trigger this code path only if the memory allocator says the pipeline can’t be run on the device in full. tile caches sound like a good option, but i’m unsure about the frequency of the use case… i’m still hoping this would only be triggered infrequently during export for very large images with complicated operations stacked onto it. if that is true i’ll defer optimisation of this code path for later.

yeah i remember you were a macintosh person i have no knowledge about this platform, but it seems the setup via moltenvk GitHub - KhronosGroup/MoltenVK: MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on iOS and macOS. is a straight forward library install + a shader translation step that computes MSL from SPIR-V.

hanatos · June 26, 2019, 2:24pm

just a quick status update here. for sports, i’m trying to replicate something like @Carmelo_DrRaw’s fill-in-flash via a guided filter. for that purpose, i need to run multiple nodes for a module in the graph. the starting point is this file:

module:rawinput:01
module:demosaic:01
module:exposure:01
module:filmcurv:01
module:display:01
module:contrast:01
connect:rawinput:01:output:demosaic:01:input
connect:demosaic:01:output:exposure:01:input
connect:exposure:01:output:contrast:01:input
connect:contrast:01:output:display:01:input
param:exposure:01:exposure:0.0
param:filmcurv:01:y2:0.8

which is parsed and creates the following module graph:

modules

the graph is then sorted topologically by pulling all dependencies for the display node (which is a sink). every module knows its parameters and automatically creates a couple of low-level nodes accordingly:

note the dead code elimination since the film curve isn’t connected to the local contrast module yet. i’m implementing the mean() filters in the guided filter as some a-trous wavelet style gaussian blur (separable, h v), hence the many iterations of blur kernels in between. for those who know the guided filter, this one specifically uses the input image as guide image (I=p) and performs two-channel blurs to blur (I,I*I) and (a,b) simultaneously.

the graphs are debug output using graphviz tools for rendering.

every one of these nodes has a compute shader that is bound to a pipeline in vulkan and then executed on the GPU. at this point i’m still always processing the full buffer, no ROI.

this is the first non-linear processing pipeline i’m testing and i needed to iron out a few things in the memory allocation/reference counting/graph traversal on the way. will post pictures once i’m more confident that they are actually what i think they are.

hanatos · July 2, 2019, 7:14am

another quick update. i might near a state where releasing some initial code wouldn’t be super embarrassing any more. and i have a few first performance numbers, to be taken with a grain of salt.

i implemented parts of googles demosaic paper ( Handheld Multi-Frame Super-Resolution ), namely the gaussian splatting for a single image (no warping yet), both for bayer and x-trans. might start a separate topic on that, the quality of this seems to be not quite on par with our previous methods yet (i may be doing it wrong).

a full pipeline with demosaic, exposure, filmic + guided filter for local contrast for the full-res image 4832 x 3204 completes very fluently on my nvidia 1080 GTX and starts to show lag on my intel HD5500. screenshot as proof:

(thanks to yteaot for the image)

performance measurement seems to be tricky. i can get timestamps out of the command buffer, but am not at all sure they mean much. for instance the intel GPU seems to flush the pipeline when switching between compute and graphics, so there is much more delay than the numbers suggest. the total duration on the intel is measured as like 8ms but the displayed lag feels more like a second. i suppose a fragment shader would be much better for intel (but i really want the command line interface without xorg), or there may be some additional trickery to hide the latency of the pipeline flush.

nvidia shows individual pipeline stages well below millisecond (like 0.0051 ms) but the overall frame time including sdl input handling and gui drawing etc when i wait for pipeline completion is 20ms (a number that is very much in line with how the gui feels).

again, this is running the full image, not a downscaled/cropped version, and i did not spend any time trying to optimise for speed yet. it feels like there should be quite something to be gained here. when playing with the demosaicing (the slowest filter here) i would easily lose or gain a factor of two by doing something that would appear to be pretty much equivalent when looking at the code.

pitbuster · July 3, 2019, 2:31pm

Would using C++'s shared_ptr or unique_ptr help you with the first two?

hanatos · July 4, 2019, 8:35am

no and no. 1) i’m not a fan of the hidden semantics of c++ and hope the language goes away soon. 2) it’s not as simple as allocating and freeing your memory in usual code. i need to run a pass over the graph on the CPU, pretending to be allocating/freeing memory, remembering the offsets, so i can safely access the addresses on the GPU later on (many times, but without re-running the allocator). the issue was that the modules would allocate their inputs and free their outputs in some order depending on graph traversal. this is not necessarily the conservative order (i.e. there might be one alloc, one free, and only after that another four refs from dependencies further down the graph). but this part is done and working now (went the safe way and added another pass over the graph that just counts references without pretending to allocate yet).

nosle · July 4, 2019, 9:38am

Looks interesting!

Whats your sloc ? Presumably all statically compiled

hanatos · July 4, 2019, 10:14am

7996 total (including a bit of unit tests), out of which 1k is vulkan init, and 1k is graph io/traversal. the rest is evenly distributed in smaller things.

there’s an additional 1395 lines of glsl shader code, most of them are like 40 loc (demosaicing is a bit more, but i don’t think i’ll be needing half of that experimental code).

dependencies are rawspeed (+pugixml), libjpg, SDL2, vulkan. i statically pull in a pthread pool and imgui, but so far the build system works well with only a Makefile and no bloat.

a fresh compile after make clean currently takes ~6s, including the pthread pool and imgui and glsl → spirv compilation, but no rawspeed rebuild.

isn’t it fun starting from scratch while things are still lean…

ggbutcher · July 4, 2019, 2:41pm

That was rawproc, 3 years ago. Learned a lot adding to it the necessary things for raw processing, to support my particular notions of workflow.

Now, it has become a bit of a rat’s nest, so with my new-found time (semi-retired, !!), I’m going to “reboot” and write rawproc2, cleanly incorporate all the libraries in one image class, and redo the tool-chain architecture a bit more cleanly.

Keeps me out of the bars…

hanatos · July 4, 2019, 3:44pm

hehe, i know… and it was darktable ten years ago. only that then i tried to run stuff on the GPU and stopped when the max supported image size was like 2048x2048… now this number is >=16kx16k and i’m much more encouraged to continue. i really hope to avoid some toolchain bloat this time.

congrats on retirement (even if only semi)! i would still recommend bars though

Carmelo_DrRaw · July 11, 2019, 8:04pm

I’m really looking forward to the first release! I see a lot of similarities with PhotoFlow’s processing pipeline, and I hope that the work you have started will end up as a full-featured GPU path in PhF!

From my side, I’d be willing to try porting the code to macOS with MoltenVK, although I’ll have to learn a lot of things on the way. What do you think?

hanatos · July 12, 2019, 9:38am

i would not call it release… and i don’t think this is in a state where you’d want to include it (i think the api needs severe changes etc). but i very much appreciate any help creating a faster pipeline for the open source world, and i think sharing early might be a good idea. keen to hear any feedback you may have.

hanatos · July 16, 2019, 8:15pm

since the detail enhancement/log curve + detail reconstruction with the guided filter is very much bound to exactly one separation/one frequency, i’m also working to get the (more expensive) local laplacian pyramids into this pipeline, for all frequency contrast enhancement. for giggles, here’s the node graph it produces so far:

this is auto-generated from this input config:

module:rawinput:01
module:demosaic:01
module:exposure:01
module:display:01
module:llap:01
connect:rawinput:01:output:demosaic:01:input
connect:demosaic:01:output:exposure:01:input
connect:exposure:01:output:llap:01:input
connect:llap:01:output:display:01:input
param:exposure:01:exposure:0.67
param:rawinput:01:filename:XXX.ARW

Carmelo_DrRaw · July 24, 2019, 4:33pm

I would like to do a first attempt at integrating your code into PhotoFlow. Could you give me some hints regarding which part of the code is responsible of building the pipeline? Also, would it be feasible to split the DT-specific code from the processing pipeline itself?

I have to admit that I had not much time yet to dig into the code you provided, that’s also why some “entry point” would be of great help…

Thanks for providing this!

hanatos · July 25, 2019, 9:20am

hi,

i don’t think there is anything dt specific now (because there’s really nothing). the pipeline code is in src/pipe and the vulkan specific stuff is in src/qvk (originates from the quake vulkan code base).

an entry point to how things are done is probably the command line client vkdt-cli in src/cli/. if you’d like to display the VkImage while it is still on the GPU instead of downloading it (and passing it to bloatware gui libraries), you’d need to init a window and surface for vulkan, too. this part of the code would be in src/gui/.

probably for a first quick test it would be easiest to provide a default pipeline (see bin/examples/*.cfg) and run it through the headless pipeline as the command line interface would, download the final pixels and send them to the display as usual.

on macintosh i would probably first try and compile the cli as-is and see whether i could get it to run through moltenvk.

Carmelo_DrRaw · August 1, 2019, 11:37am

@hanatos I have done a first attempt at porting your code to MacOS, using vulkan and moltenvk from MacPorts on a 10.14 MacOS system, and it seems to work flawlessly!

I had to introduce some minor modifications in the code in order to get it compiled under macOS, and I had to disable the pin ourselves to a cpu part in threads.h (macOS apparently does not provide the sched_setaffinity() function).

The gui seems to be working properly, and is rather responsive:

It would be interesting to run some benchmark and see if performances are as expected… any suggestion?

hanatos · August 4, 2019, 9:32am

oh nice, that was quick! i heard moltenvk is a big dependency with llvm and all… but sounds like you haven’t had much trouble getting it to work, that’s great!

yes, the affinity thing is messy. i guess there’s an equivalent on macintosh, but i forgot how it works. this is currently not used anyways, i was planning to use that for parallel thumbnail creation and database queries (none of this is implemented). parallel thumbs should be interesting, needs some pipeline interleaving on GPU.

you can run ‘-d perf’ to get some output with respect to timings on stdout… but i’m not sure i trust these values. it also outputs a frames/second counter which is all around the full loop. but the intel GPU seems to cheat there, too (and show an old image quick and only every so often update it). on my nvidia GPU it’s in the lower single-digit milliseconds for full demosaic + local laplacian etc for 24MPix.

i think i have a hard coded #define HALF_SIZE_DEMOSAIC in the demosaic/main.c file at this point. because everything was so stupid fast i wasn’t sure how to go ahead with the region of interest processing. going forward i guess we’ll have slow modules again at some point and it might still be worthwhile at least switching resolution.

there’s also the command line interface vkdt-cli, but the run time is dominated by disk io.

btw i was away last week and will be away from keyboards the next three weeks, too, so will be unresponsive (but am still interested in pushing for this!).