No idea. I often switch between different architectures but I’ve haven’t seen the difference. The more faster (cpu clock) is usually better (assuming 16 cores min, which seems pretty standard nowadays, but I may be wrong).
What David essentially shakes down in G’MIC might become candidates for vkdt shaders.
One has to be careful retro-fitting GPU operations to CPU-centric softwares like G’MIC. For that individual operation a lot of resource needs to be expended marshalling the image data into and out of the GPU, so the net gain might not be significant in the grand scheme of an otherwise CPU-centric G’MIC toolchain
Exactly.
One major concept used by GPU-centric ML libraries is that the whole library is focusing on building a pipeline that can be entirely run on the GPU, thus limiting the data I/O with the CPU or the regular RAM of the PC. That’s the only way to get full speed and to benefit from the GPU architecture.
But that also adds a few strong technical constraints you don’t have with a CPU-centric approach :
in particular, you often get a more limited memory (except if you have thousands of dollars to spend on high-end GPUs).
The approach is totally different, and that’s why it’s often not feasible (or at least easy) to just “port” a code from CPU to GPU. If you plan to use a GPU somewhere, your code must take this constraint into account at the very beginning (which was clearly not the plan when I started writing G’MIC).
For the user, there’s only one letter that changes between a CPU and a GPU, but from the developer’s point of view, they’re two completely different worlds.
There’s a few case where there might be. I can see operations which requires a lot of repeat() per pixels can benefit from GPU pipeline, like fractals. I also think my bin2dec algorithm (can work on millions of binary digits) works better there too. But, it’s not happening anyway.
fwiw david and i exchanged network weights in the past and i had a GPU version of his residual net for denoising. it’s just very laborious to adjust the GPU implementation to network architecture changes so i believe my code rotted. i think the most promising route for useful neural stuff in vkdt is train in pytorch, serialise weights and load them into a module that has very limited building block options, such as this demosaicing/denoising convolutional U-net: vkdt/src/pipe/modules/jddcnn at master · hanatos/vkdt · GitHub . and yes, half precision tensor cores help with performance for this kind of compute.
… just in case anyone would like to jump on it and make it compatible again, i’m talking about g’mic’s denoise_cnn and the resnet code in vkdt.
Funny that @hanatos just mentioned it, because these last days, I’ve been tried to completely recode the denoise_cnn
command, in order to use a CNN that is able to take an estimated noise level (scalar value), as an input to the network.
So that a single network can denoise an image more or less, according to the user’s wishes.
It turned out to be not that easy to make the network take this parameter into account, rather than doing his own noise level estimation and completely ignore the instructions of the user (a bit like a child ).
That issue could be finally solved using FiLM (Feature-wise Linear Modulation), a technique which I find a bit “violent”, but that’s what you need apparently for networks (unlike with children!).
The new denoising network has now 450k parameters (larger than before).
Some examples:
Noisy image:
Denoised images:
gmic noisy.png +denoise_cnn[0] 5,32 +denoise_cnn[0] 10,32 +denoise_cnn[0] 20,32 +denoise_cnn[0] 40,32
I’ll try to release version 3.5.1 of G’MIC soon, with this network updated.
awesome. i handed the noise variance/parameter as an extra channel per pixel in the past and it would sometimes be picked up.
your results look really useful there.
That is was I’ve tried first yes, and while this works indeed for smaller networks, it does not for the new network (that has a total of 12 convolutional layers).
I’ve tried various things (like appending the noise channel at the output of each intermediate layer, or encoding the noise channel with multiple channels of sines with various frequencies).
But FiLM is the only thing that made a real difference.
awesome. do you perchance have an automated visualisation of your architecture? maybe even just in form of a table? or as a graphviz/dot graph? are these straight 12 layers 3x3 with like 64 channels? or U-net/downsampling too?
Yes, almost, they are just grouped by pairs in residual blocks. The network also includes a small MLP dedicated to the FiLM encoding (yes, this requires its own MLP to be trained alongside ).
FYI, here is the structure, as stored in G’MIC (less readable than a graph sorry, but it has all the info in there).
$ gmic nn_load https://gmic.eu/gmic_denoise_cnn.gmz nn_print q
[gmic]./ Start G'MIC interpreter (v.3.5.1).
[gmic]./ [nn_lib] Load network from file 'https://gmic.eu/gmic_denoise_cnn.gmz' (include training data).
[gmic]./ [nn_lib] Print info on neural network 'denoise'.
* Network name: denoise
- Layer: NL (type: input)
* Output size: 1,1,1,1
- Layer: FI1_fc (type: fc)
* Input: NL (1,1,1,1)
* Output size: 1,1,1,64
* Properties: learning_mode=3, regularization=0
- Layer: FI1 (type: nl)
* Input: FI1_fc (1,1,1,64)
* Output size: 1,1,1,64
* Property: activation=leakyrelu
- Layer: FI2_fc (type: fc)
* Input: FI1 (1,1,1,64)
* Output size: 1,1,1,64
* Properties: learning_mode=3, regularization=0
- Layer: FI2 (type: nl)
* Input: FI2_fc (1,1,1,64)
* Output size: 1,1,1,64
* Property: activation=leakyrelu
- Layer: FI3_fc (type: fc)
* Input: FI2 (1,1,1,64)
* Output size: 1,1,1,2
* Properties: learning_mode=3, regularization=0
- Layer: FI3 (type: nl)
* Input: FI3_fc (1,1,1,2)
* Output size: 1,1,1,2
* Property: activation=leakyrelu
- Layer: ALPHA (type: split)
* Input: FI3 (1,1,1,2)
* Output size: 1,1,1,1
* Property: axis=c
- Layer: BETA (type: split)
* Input: FI3 (1,1,1,2)
* Output size: 1,1,1,1
* Property: axis=c
- Layer: R_ALPHA (type: resize)
* Input: ALPHA (1,1,1,1)
* Output size: 64,64,1,64
* Property: interpolation=3
- Layer: R_BETA (type: resize)
* Input: BETA (1,1,1,1)
* Output size: 64,64,1,64
* Property: interpolation=3
- Layer: IN (type: input)
* Output size: 64,64,1,3
- Layer: FM (type: conv2d)
* Input: IN (64,64,1,3)
* Output size: 64,64,1,64
* Properties: kernel=5x5, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: NFM (type: normalize)
* Input: FM (64,64,1,64)
* Output size: 64,64,1,64
* Properties: normalization=global, learning_mode=3
- Layer: FM2 (type: mul)
* Inputs: NFM,R_ALPHA (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: FM3 (type: add)
* Inputs: FM2,R_BETA (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C0_conv2d0 (type: conv2d)
* Input: FM3 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C0_nl0 (type: nl)
* Input: C0_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C0_conv2d1 (type: conv2d)
* Input: C0_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C0_add (type: add)
* Inputs: C0_conv2d1,FM3 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C0 (type: nl)
* Input: C0_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C1_conv2d0 (type: conv2d)
* Input: C0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C1_nl0 (type: nl)
* Input: C1_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C1_conv2d1 (type: conv2d)
* Input: C1_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C1_add (type: add)
* Inputs: C1_conv2d1,C0 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C1 (type: nl)
* Input: C1_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C2_conv2d0 (type: conv2d)
* Input: C1 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C2_nl0 (type: nl)
* Input: C2_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C2_conv2d1 (type: conv2d)
* Input: C2_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C2_add (type: add)
* Inputs: C2_conv2d1,C1 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C2 (type: nl)
* Input: C2_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C3_conv2d0 (type: conv2d)
* Input: C2 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C3_nl0 (type: nl)
* Input: C3_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C3_conv2d1 (type: conv2d)
* Input: C3_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C3_add (type: add)
* Inputs: C3_conv2d1,C2 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C3 (type: nl)
* Input: C3_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C4_conv2d0 (type: conv2d)
* Input: C3 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C4_nl0 (type: nl)
* Input: C4_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C4_conv2d1 (type: conv2d)
* Input: C4_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C4_add (type: add)
* Inputs: C4_conv2d1,C3 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C4 (type: nl)
* Input: C4_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C5_conv2d0 (type: conv2d)
* Input: C4 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C5_nl0 (type: nl)
* Input: C5_conv2d0 (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: C5_conv2d1 (type: conv2d)
* Input: C5_nl0 (64,64,1,64)
* Output size: 64,64,1,64
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: C5_add (type: add)
* Inputs: C5_conv2d1,C4 (64,64,1,64 and 64,64,1,64)
* Output size: 64,64,1,64
- Layer: C5 (type: nl)
* Input: C5_add (64,64,1,64)
* Output size: 64,64,1,64
* Property: activation=leakyrelu
- Layer: NOISE (type: conv2d)
* Input: C5 (64,64,1,64)
* Output size: 64,64,1,3
* Properties: kernel=3x3, stride=1, dilation=1, shrink=0, boundary_conditions=neumann, learning_mode=3, regularization=0
- Layer: OUT (type: sub)
* Inputs: IN,NOISE (64,64,1,3 and 64,64,1,3)
* Output size: 64,64,1,3
* Total: 48 layers, 454153 parameters.
[gmic]./ Quit G'MIC interpreter.
(NL
stands for Noise Level, and is a scalar value).
EDIT: Finally, 14 convolutional layers in total
nice. now for a gpu port i’d need to reimplement some of these special layers, and then probably re-train just a little to account for doubles vs. f16 in the computation.
also i’d probably make it a U-net architecture to make speed more realistic. 14 full buffer convolutions with that number of channels is probably above my pain threshold for interactive adjustments.
now i’m dreaming of an automatic / scripted export from gmic network architecture to pytorch code (for 16-bit re-training) to vkdt inference code…
That looks like a painful task. I’m not sure it worthes the effort!
And yes, I confirm, the new network is lot slower than the previous one.
I may try to retrain a lighter one in the future.
Oooooohhh, collaborative communication between the King of ToolChain (@David_Tschumperle ) and the King of GPU (@hanatos ). Books will be written about this…
chrr thanks but i doubt it
Haha. In general, I’m better known as “the king of pain in the ass”, which makes a change