Difference between two regression techniques

Hello, as I wrote some time ago, I wanted to decorrelate an IR scan and a visible light scan (red channel) of the same slide. This because the red channel leaks in the IR and I want to better isolate the dust and scratches before inpainting them.

The idea is to do a linear regression between red channel and IR channel, then to subtract the calculated red-contribution from the IR.

I used the simple formula from Wikipedia.
More specifically, I wanted to replicate this:

I tried first to replicate the final formula:
b= (avg(xy) - avg(x)*avg(y)) / (avg(x^2) - (avg(x))^2)
so I wrote in G’Mic ([-1] is IR, [-2] is red channel):

regression:
# y = IR
# x = original channel (red, luminance)
# y = a + bx
# b= (avg(xy) - avg(x)*avg(y)) / (avg(x^2) - (avg(x))^2)
# a= _y - a * _x
# xy
--mul[-2] [-1]
xy={-1,ia}
-rm[-1]
x={-2,ia}
y={-1,ia}

--sqr[-2]
xx={-1,ia}
-rm[-1]
# (avg(x))^2
x2={{-2,ia}*{-2,ia}}

_b={($xy-($x*$y))/($xx-$x2)}
_a={$y-$_b*$x}

then I thought about the first formula, more specifically the one with summatories:
b= (sum( (x_i-avg(x))*(y_i-avg(y)) )) / (sum( (x_i-x^2)^2) )
and I wrote this code ([-1] is IR, [-2] is red channel):

regression2:
# y = IR
# x = original channel (red, luminance)
# y = a + bx
# b=(sum( (x_i-avg(x))*(y_i-avg(y)) )) / (sum( (x_i-x^2)^2) )
# a=avg(y)-b*avg(x)
--sub[-2] {-2,ia}
--sub[-2] {-2,ia}
--mul[-2] [-1]
b1={-1,is}
-rm[-3,-2,-1]

--sqr[-2]
--sub[-3] {-1,ia}
b2={-1,is}
-rm[-2,-1]

_b={$b1/$b2}
_a={{-1,ia}-$_b*{-2,ia}}

The results are different and I’m not sure why. The first implementation is the one which gives me the best (basically perfect) results.

Could someone help me? I can continue anyway working but I’d like to understand the mistake.

Thanks

Hello.
Interesting problem :slightly_smiling:

I don’t have any certainties about the cause, but to me, the second formula looks a bit more ‘unstable’ to compute from a purely numerical point of view. I suppose that if your numerator and denominator are two big numbers (and as you have probably a lot of pixels, they are), you will loose precision in the final result, compared to the fraction done using already averaged numbers. I may be completely wrong though, but I already had some comparable issues in the past with the computation of fractions.

Also, just in case you could be interested by this. I use to compute my linear regressions using the command -solve that computes the least square solution to a linear system.
The idea is to build the system [X,1].[a;b] = [Y], where X and Y are the column vectors composed of the values of images X and Y you want to match, and [a;b] the vector of the coefficients you are looking for.
This small piece of code compute a and b for two input images [0] = X and [1] = Y such that ||Y - (a.X + b)||^2 is minimum:

linear_reg : 
 --l -y 1,100%,1,1,1 -a[0,-1] x -r 100%,{min(h,10000)} -solve. [-2] -k. a={[0]} b={[1]} -rm -endl  # Get vector [a,b]
 --*[0] $a -+. $b  # Compute [2] = aX+b (should match [1]).

I hope this helps.

Thanks for the info! What you write makes sense, I’ll discard the second script I wrote.
Now I’m trying to understand yours. I think I got it but I still have some questions.

First of all, I wasn’t able to find any passage in the reference docs where you explain the meaning of the dot after a command. I understand it means [-1]. Is it there?

linear_reg : 
 --l -y 1,100%,1,1,1 -a[0,-1] x -r 100%,{min(h,10000)} -solve. [-2] -k. a={[0]} b={[1]} -rm -endl  # Get vector [a,b]
 --*[0] $a -+. $b  # Compute [2] = aX+b (should match [1]).

means

--local -unroll 1,100%,1,1,1 -append[0,-1] x -resize 100%,{min(h,10000)}
     -solve. [-2] -keep. a={[0]} b={[1]} -rm -endlocal  # Get vector [a,b]
     --mul[0] $a -add. $b  # Compute [2] = aX+b (should match [1]).

--local
takes I guess all the images (that should be only two, [0,1]) and puts them in a separate “list”, but in the docs -local appears with single dash, here double. What is its meaning in this context? especially since at the end you do -rm so the local list gets emptied, nothing would have to be added back to the outer one.

1,100%,1,1,1
creates a new image with same y size as the previous one, filled with 1s for the b.

-append[0,-1] x
from the first and the just generated image that are 1 by (number of pixels) I get only one. I guess the first one (that means [0]) gets modified in-place.

-resize 100%,{min(h,10000)}
here you want at most 10k pixels along y axis for speed reason, and it applies to all the three images. In my case the images I’m working on are about 32 Mpx… good idea. Why default nearest interpolation and not average?

-solve. [-2]
I guess this means [-1] is the only image which got the result of the “append” command, otherwise [-2] would not have been a simple 1D vector. Result is an image 2x1 using least squares because over-determined.

a={[0]} b={[1]}
I know that {…} means that you request a feature implicitly of the image [-1] (section substitution rules) but I cannot find the explanation of the use of [0] or [1] as feature (I guess it gives the first and second values of the content of the image)

Yes it means [-1]. From the reference page :

Command selections '[-1]','[-2]' and '[-3]' are so often used that they have their own shortcuts, 
respectively '.', '..' and '...'. For instance, command '-blur..' is equivalent to '-blur[-2]'. 
These shortcuts work only for command selections, not for command arguments. 

A double dash means “works on a copy instead of in-place”. Here, we copy the two input images into a separate local environment (so inside, you have a list with the two copies). I do that because I modify the images data inside the local environment. Just before exiting the local environment, I don’t need the modified image data anymore, so I remove them all. That’s because I’ve found my coefficients a and b and stored their values in variables.
Here the --local just ensures you won’t modify your two input images at all.

What I do in the --local environment is I construct a matrix A (image [0]) and a vector B (image [1]), such that I’m looking to get the unknown X (the vector of the two coefs a and b of the linear transformation), so that :

A.X = B  (matrix product).

This is an over-determined system in this case, but command -solve is able to return the least-square solution X to this system. Here, A is a matrix with 2 columns : the first column is the set of values of the first image, and the second column is filled with 1. The vector B is the set of values of the second image.

Keeping only 10k values is not only for speed reasons, it’s also because if you keep all the values you get a system that is so-overdetermined that you lose numerical precisions when computing the least-square solution with -solve (this is due to the unstability of the pseudo-inverse computation).

Thanks for the answer, I think I got all of it. I also tried myself to change resize algorithm and there is little change in the final calculated variables.

However your algorithm doesn’t seem to produce correct results, while mine does.

I could use mine and be happy but I would like to understand what is going wrong.
I uploaded the source images:
red channel
IR channel

and you can try yourself if you want, just copy and paste:
mine:

-i Dia_1_red.tif -i Dia_1_IR.tif
--mul[-2] [-1] xy={-1,ia} -rm. x={-2,ia} y={-1,ia}
--sqr.. xx={-1,ia} -rm. x2={{-2,ia}*{-2,ia}}
_a={($xy-($x*$y))/($xx-$x2)} _b={$y-$_a*$x}
--*[0] $_a -+. $_b --display_histogram

yours:

-i Dia_1_red.tif -i Dia_1_IR.tif
--l -y 1,100%,1,1,1 -a[0,-1] x -r 100%,{min(h,10000)}
-solve. [-2] -k. a={[0]} b={[1]} -rm -endl
--*[0] $_a -+. $_b --display_histogram

I paste the summary of the input images (I cut some lines):

[0] = 'Dia_1_red.tif':
  min = 3619, max = 42386, mean = 14729.3, std = 5100.47, coords_min = (1838,1763,0,0), coords_max = (3342,1866,0,0).
[1] = 'Dia_1_IR.tif':
  min = 8565, max = 58354, mean = 48172.9, std = 3108.76, coords_min = (2911,984,0,0), coords_max = (3348,1968,0,0).

My result:

[gmic]-2./regression/ Set global variable _a='0.5623662604504396'.
[gmic]-2./regression/ Set global variable _b='39889.67002724859'.

[2] = 'Dia_1_red~.tif':
  min = 41924.9, max = 63726.1, mean = 48172.9, std = 2868.33, coords_min = (1838,1763,0,0), coords_max = (3342,1866,0,0).

And it makes sense, the IR histogram is compressed and shifted to the right.

Your algorithm:

[gmic]-1./*local/ Set local variable a='2.981390953063965'.
[gmic]-1./*local/ Set local variable b='0.0001807634544093162'.

[2] = 'Dia_1_red~.tif':
  min = 10789.7, max = 126369, mean = 43913.7, std = 15206.5, coords_min = (1838,1763,0,0), coords_max = (3342,1866,0,0).

The average is not correct and the std is larger, because somehow instead of compressing and shifting, your result expands and shifts only a little.

I can see with
--sub[-1] [-2] --display_histogram
that the the resulting difference is quite small:

[3] = 'Dia_1_red~.tif':
  min = -4566.15, max = 35204.4, mean = 0.00165331, std = 1198.79, coords_min = (3362,2617,0,0), coords_max = (2911,984,0,0).

With yours:

[3] = 'Dia_1_red~.tif':
  min = -29142, max = 70950.3, mean = -4259.21, std = 12396.3, coords_min = (4018,709,0,0), coords_max = (3338,1870,0,0).

Maybe this depends on the algorithms themselves and not on the implementation?
What do you think?

I don’t know what is going on with your particular case, but the piece of code I’ve proposed works for me, at least with a test image:

$ gmic -testimage2d  512 --* 2 -+. -70 --l -y 1,100%,1,1,1 -a[0,-1] x -r "100%,{min(h,10000)}" -solve. [-2] -k. a={[0]} b={[1]} -rm -endl

returns approximatively a=2 and b=70, which is correct.

Interesting test.
With mine I also get a=2 and b=-70, then the input images must affect the two algorithms differently.