Fun with Summed Area Tables

David_Tschumperle · February 19, 2021, 1:35pm

Yes, clearly this should be possible.
As long as you don’t need to ‘leave’ the math parser, you can stick with double-valued computations.

snibgo · February 19, 2021, 2:27pm

The major feature of Integral Images (“II”, aka Summed Area Table, SAT) is that run-time is independent of window size. So finding the windowed mean (aka blur) with windows sized 3x3 or 301x901 takes the same time. In addition, if we need a number of different windowed means for an image, we create the SAT once then use it (quickly) for each different mean.

For large windows the performance advantage over conventional methods is huge, several orders of magnitude, far outweighing the cost of extra precision.

We don’t often need to save the SAT in a file. For example, my implementation of Guided Filters doesn’t save the SAT.

As I show at Integral images, we can use II for a variety of applications, such as edge detection and sharpening. It isn’t the answer to life, the universe and everything, but it is a very useful and fast tool.

afre · February 19, 2021, 8:03pm

That is why I use convolve. I agree with your other points. I haven’t but should debug.

Reasons for my interest in SAT. It is also an elegant solution.

garagecoder · February 19, 2021, 8:45pm

Initial working 64bit:

gcd_boxfilter_local64 : check ${"is_image_arg $1"}
  pass$1 0 max. 1 round. 1
  repeat $!-1 l[$>,-1]
    f.. "
    begin(
      const boundary=1;
      ref(crop(#0),img);
      px(x,y) = (img[x + y*w#0 + c*wh#0]);

      const W=w-1; const H=h-1;

      T(X,Y)=Y>1?px(X,Y-1):Y*px(X,0);
      B(X,Y)=Y<H?px(X,Y):(1+Y-H)*px(X,H)-(Y-H)*px(X,H-1);
      G(X,Y)=X<W?B(X,Y):(1+X-W)*B(W,Y)-(X-W)*B(W-1,Y);
      D(X,Y)=X>1?T(X-1,Y):X*T(0,Y);
      E(X,Y)=X<W?T(X,Y):(1+X-W)*T(W,Y)-(X-W)*T(W-1,Y);
      F(X,Y)=X>1?B(X-1,Y):X*B(0,Y);

      for (C = 0; P = 0, C<s#0, ++C,
        for (X = 0; S = 0, X<w#0, ++X; ++P,
          S += img[P];
          img[P] = S;
        );
        for (Y = 1, Y<h#0, ++Y,
          for (X = 0; S = 0, X<w#0, ++X; ++P,
            S += img[P];
            img[P] = img[P-w#0]+S;
          )
        )
      )
    );
    K=i#1; R=K*2+1; S=R*R;
    x0 = x - K;
    y0 = y - K;
    x1 = x + K;
    y1 = y + K;
    (G(x1,y1) + D(x0,y0) - E(x1,y0) - F(x0,y1))/S"
  endl done rm.

Window sizes are taken as round(k) * 2 + 1 (there’s no interp). It can possibly be optimised a bit, but I’ve already done some stuff to make the cumulate fast.

Edit: small further speedup for the cumulate

garagecoder · February 20, 2021, 3:13pm

And finally a 64bit with bilinear interp (supports non-integer windows). This one is quite slow in comparison, so if anyone can think of ways to speed it up I’d be grateful!

gcd_boxfilter_local64_lerp : check ${"is_image_arg $1"}
  pass$1 0 max. 1
  repeat $!-1 l[$>,-1]
    f.. "
    begin(
      const boundary=1;
      ref(crop(#0),img);
      const W=w-1; const H=h-1;
      px(x,y) = (img[x + y*w#0 + c*wh#0]);
      qx(x,y) = (
        xA=floor(x);
        xB=ceil(x);
        yA=floor(y);
        yB=ceil(y);
        lx=x-xA;
        lerp(
          lerp(px(xA,yA),px(xB,yA),lx),
          lerp(px(xA,yB),px(xB,yB),lx),
          y-yA
        )
      );

      T(X,Y)=Y>1?qx(X,Y-1):Y*qx(X,0);
      B(X,Y)=Y<H?qx(X,Y):(1+Y-H)*qx(X,H)-(Y-H)*qx(X,H-1);
      G(X,Y)=X<W?B(X,Y):(1+X-W)*B(W,Y)-(X-W)*B(W-1,Y);
      D(X,Y)=X>1?T(X-1,Y):X*T(0,Y);
      E(X,Y)=X<W?T(X,Y):(1+X-W)*T(W,Y)-(X-W)*T(W-1,Y);
      F(X,Y)=X>1?B(X-1,Y):X*B(0,Y);

      for (C = 0; P = 0, C<s#0, ++C,
        for (X = 0; S = 0, X<w#0, ++X; ++P,
          S += img[P];
          img[P] = S;
        );
        for (Y = 1, Y<h#0, ++Y,
          for (X = 0; S = 0, X<w#0, ++X; ++P,
            S += img[P];
            img[P] = img[P-w#0]+S;
          )
        )
      )
    );
    R=i#1; S=R*R; K=(R-1)/2;
    x0 = x - K;
    y0 = y - K;
    x1 = x + K;
    y1 = y + K;
    (G(x1,y1) + D(x0,y0) - E(x1,y0) - F(x0,y1))/S"
  endl done rm.

Edit: I’ve also pushed this to g’mic community as gcd_boxfilter_local_hp for now. Most of the slowdown is due to the bilinear interp (it usually at least doubles the time taken).

afre · February 21, 2021, 12:12am

What does hp stand for? Looks like you are enjoying this exercise and have disposable time.

Another wrench: I wonder if a weighted filter is possible with SAT…

garagecoder · February 21, 2021, 8:58am

The “hp” stands for high precision (that’s mentioned in the help) - I thought it better not to include machine architecture in the name. The end goal is the good part, this time the writing of it was quite frustrating (boundary handling especially because I had to work that out)! I have some time in spots, hopefully more than last year…

If you mean e.g. a guassian weighted then I don’t believe so, without multiple iterations. I do think this filter would benefit hugely from being in a compiled language, then multiple iters would be less of a problem. We should probably wait to see how many uses it has before looking into that though. At the moment I have three in mind - variable blur, blend feathering, simulated depth blurring.

Edit: I found a nice way to improve precision is by subtracting image average at the beginning, then add back after. Will add that to the commands.

afre · February 21, 2021, 10:04am

So far we have an adaptive window (size) but not a weighted window (kernel). It doesn’t have to be Gaussian. I want the freedom to choose what it will be. (The kernel itself could be adaptive as well but that is another story.)

Yes, I have done that. I still don’t quite know

garagecoder · February 21, 2021, 10:14am

I’m fairly sure that means creating four different summed areas (although contained in the same buffer). In the standard implementation, the origin can be seen as the top or bottom left. Sums increase with x+y.

An alternative is to start from the image centre, with sum increasing with abs(x)+abs(y) with respect to the new origin (then the top left is actually location (-(w/2), -(h/2)). I’m not convinced that’s a good solution considering how much more difficult it is to handle the boundaries. I expect that to affect performance a lot!

garagecoder · February 21, 2021, 10:46am

@snibgo I read most of your article (thanks), but have a question about the boundaries. There’s a part which says:

Pixels near the edge are a pain. The window is effectively smaller near the edges, so we might divide by another image that represented the variable window area. This is easy in C code (eg the deintegim process module) but is awkward in a CLI command.

If you have time, would you mind explaining the maths of that a little more? If I understand correctly, it implies a smaller blurred area at the edges in the C code, rather than a Neumann boundary. Would that not have quite a great effect with larger window sizes?

snibgo · February 21, 2021, 1:36pm

SAT gives the same weight to every input pixel in the window. It is not possible to assign different weights. We can use multiple SATS, and we can (in principle) approximate other filters to any desired accuracy by a series of SATs each of some of appropriate window size and weight. More detail and references at Application: approximating other convolutions.

Yes, the blurred area is smaller at edges. The result is better than a Neumann boundary, which gives too much weight to edge pixels. In more detail:

Suppose we are using Integral Image to find windowed means. Suppose our window is 101x1 pixels, Width x Height, so each output pixel is the mean of 101 input pixels, which include 50 to the left and 50 to the right.

This is fine for pixels near the centre. But for pixels within 50 of an edge, what value do we want for the mean, and how do we calculate it? Two approaches are possible:

(a) Change widow sizes. So for pixels at left edge, the window is 51x1. We sum the values of 51 pixels, then divide the sum by 51. We ignore pixels to the left of the left edge, because there aren’t any. No special processing is needed to create the SAT. But we do need special processing when reading the SAT to get the sums or means.

(b) Keep a constant window size, 101x1 pixels. So we always sum the values of 101 input pixels, and always divide by 101, but some of the input pixels are invented. (ImageMagick calls these “virtual pixels”, see ImageMagick – Command-line Options. For GMIC, see https://gmic.eu/tutorial/images-have-edges.shtml , but “mirror” is also an option.) However we invent the values, the mean will be biased. For example, if we use “-virtual-pixel=edge”, GMIC “Neumann”, then the input pixel at the edge will have a massively disproportionate weight on the mean, eg 50 times the weight of any other pixel. The other problem with this approach is that the SAT must be built for a given window size.

I show two implementations for ImageMagick: (1) in Windows BAT scripts and (2) in process modules, written in C.

The BAT scripts are just for interest. For production use, I recommend the process modules.

BAT scripts cannot sensibly change window sizes between pixels, so they use approach (b). C code can easily change window sizes, so the process modules use approach (a).

We use process modules in pairs: “-process integim” builds the SAT, and is independent of window size. Then we use “-process deintegim” to get the sum or mean for the required window size.

The process modules, by default, pre-multiply values for the SAT by alpha, and post-divide by the sum of alpha. I think this is good behaviour, but it is “my own invention” so is open to comments/criticism.

Integral images has bad links to the C source code. They are acually at integim.c and deintegim.c.

A version of deintegim.c could easily be written that varied the window size according to a saliency map.

garagecoder · February 21, 2021, 1:50pm

That’s very helpful thanks. Seems like we should give the reduced window option a go at some point. For reference, here’s how the Neumann boundary is done in gcd_decumulate (I couldn’t think of something better, but I have a habit of missing obvious solutions): Summed Area Table (SAT) box filtering

Edit: the main thing I was trying to avoid there is building a SAT larger than the input image.

Edit2: that’s also with the goal of per-pixel windows (i.e. not separable to 1D)

snibgo · February 21, 2021, 3:06pm

Thanks for the link, @garagecoder. That helps my understanding.

On precision: yes, subtracting the image mean before calculating the SAT, then adding it after we use the SAT, should work well. The SAT will then contain negative values, which shouldn’t be a problem. We would need to carry that average colour forwards, of course. In ImageMagick terms, we would calculate the average in integim.c and store it in an image property, so deintegim.c can use it.

On non-integer window sizes, which we would need if sizes are modulated by a saliency map: If I understand @garagecoder correctly, when calculating the sum of a window sized 101.2x101.2, bilinear interpolation from four values is used at each corner of the window, to get the sum at that corner. Then those four values are used to calculate the overall sum of that window.

A simpler and faster method would calculate the sums of two windows, sized 101x101 and 102x102 [EDIT oops, I mean 103x103], then interpolate between those sums. Would this give the same result? I think it would, but I’m not sure. [EDIT: update: no, I was wrong, the result is generally different, so my simpler faster method is generally wrong.]

Incidentally, I can’t see any advantage from restricting the windows to be square.

afre · February 21, 2021, 8:44pm

Treatment of boundaries along 1D data: repeat-full (Neumann), repeat-half, symmetry-full (mirror), symmetry-half, inbound pixels only. Window size=9 (radius=4), which is extreme given the data is only 4 data points in length. Results in second cluster.

Remarks
1 Notice symmetry-full results in a reverse gradient.
2 2D data also have corners to consider.

snibgo · February 21, 2021, 11:05pm

The result would simply be the average of all the values in the image if r=4; i.e., 1.20 1.20 1.20 1.20.

For my process modules, yes, each output would be the mean of the four inputs. That mean is 0.15, not 0.12.

garagecoder · February 21, 2021, 11:44pm

I think we would need four values interpolation, especially if windows aren’t restricted to square. With a non-square window and non-integer sizes, the requested sum location could lie anywhere in the region between four sums. Will recheck the maths later.

Indeed the current g’mic filters could very easily use windows specified by two dimensions per pixel - the option is simply ignored for now. Thankfully the boundary handling described still applies to that and non-integer windows.

The main reason for manually implementing bilinear in the “high precision” version was to match the standard version, where we get bilinear interp more or less for free. In high precision we’re forced to implement an image buffer directly, rather than using a normal g’mic image in the stack.

snibgo · February 22, 2021, 12:38am

Ah, yes, that’s right, thanks. For non-square windows, we do need your method of interpolating at each corner.

garagecoder · February 22, 2021, 8:14pm

I thought about doing the two point linear interp for square window sums since it would be a nice saving, but actually I think four point is still needed. I was forgetting that the points along the diagonal in bilinear are quadratic. Of course it does not need to be specifically quadratic, but it shows we’re still interpolating four points (sums):
bilinear4x4

snibgo · February 22, 2021, 8:37pm

Yes. From your observation, I should have said: “Certainly for non-square windows, we do need your method of interpolating at each corner.”

For square windows, the situation is less obvious (to me). The required value for a corner lies on the diagonal of a pixel, which mislead me to think the value is determined only by the values at the extremes of that diagonal. But that is not the case.

However, for large windows and ordinary photos the difference between your accurate method of “lerf four points at each corner”, and my inaccurate “weighted average of two computed window sums” is probably small. Maybe too small to notice.

Or maybe I’m just trying to salvage my simple, elegant and fast but wrong solution.

afre · February 22, 2021, 11:31pm

I don’t think so but there are papers and patents that do do that. In general, there are a few main strategies:

– multiple SATs
– partial to complete SATs (tricks to calculate SATs with less computes)
– compute SATs in a way that suits hardware architecture

But as we have established, we might not have gains from these using G’MIC.

PS Example

github.com

andmax/gpufilter/blob/master/src/sat_gpu.cuh

/**
 *  @file sat_gpu.cuh
 *  @brief Algorithm SAT (Summed-area Tables) in the GPU with borders
 *  @author Andre Maximo
 *  @date Feb, 2012
 *  @copyright The MIT License
 */

#ifndef SAT_GPU_CUH
#define SAT_GPU_CUH

#define MST == // measure step time: no: == ; yes: >=
#define LDG // uncomment to use __ldg

//== NAMESPACES ================================================================

namespace gpufilter {

//== IMPLEMENTATION ============================================================

This file has been truncated. show original