Is my computer really this slow?

heckflosse · September 11, 2016, 7:24pm

I don’t know whether it’s the bottleneck or not. For this reason I trust your expertise.

About the scaling: 0.332 when 4 cores are used is almost exactly what I would expect (0.449 * 3 / 4 = 0,33675). That means -apply_parallel_overlap could reduce this time even more when it can use more than 4 cores.

About the 723 ms for a 36 MP file: Was that using 32 bit floating point or 8 bit integer data? The 210 ms I referred to was 36 MP 32 bit floating point data. When using the branch-free median9 on 8 bit integer data I would expect the processing time to go down by ~ 3/4 (to ~60 ms). Though I can’t test that in RT for 8 bit integer data. But I know that gcc 5.x for the 32 bit floating point median9 case generates nice vectorized code (at -o3 at least). And for 8 bit there are also _mm_min_epu8 and _mm_max_epu8 which the compiler could use to process 16 values at once.

heckflosse · September 11, 2016, 7:25pm

You’re right. I misunderstood that. But no problem. You didn’t waste my time

David_Tschumperle · September 11, 2016, 7:29pm

Yes, that was with float32 pixel values.
Compiled only with -O2 and -mtune=generic with g++-5.4.

heckflosse · September 11, 2016, 7:37pm

Ok, thank you

garagecoder · September 11, 2016, 9:58pm

Switching circular to standard -erode/dilate seems to produce the biggest speed gains for me, but only as a demonstration; there are visible differences so it isn’t as simple as that.

A little more can be squeezed by limiting memory copies (i.e. use shared buffers etc.), but nothing like as significant.

heckflosse · September 11, 2016, 11:01pm

@garagecoder @David_Tschumperle

For me the dehaze algorithm in G’MIC was (and is) not the point of interest.

But speeding up core algorithms (like median, gaussian blur etc.) could give speedups for a lot of filters I guess.

Looking at CImg code (which is C++) we could try to use the fastest implementations in both (RT and G’MIC). I don’t say that RT always has the fastest implementations, but I know for some parts that they are very fast (median and gaussian blur for example) and for other parts G’MIC may have faster and better ones.

I just had a look a short look at CImg code. It looks like there is a lot of potential to speed up things there. In fact it reminds me on RT some years ago, where nothing has been really optimized and RT was really slow.

One example: Why does CImg still uses pow(x, 1.0/3) to calculate a cubic root where there is std:cbrt(x) available and faster too? I know they are not the same as pow(x, 1.0/3) does not work for negative x

I just cloned CImg and offer my help to optimize CImg performace (which afaik is the base for G’MIC performance) by making pull requests if you don’t mind @David_Tschumperle

Ingo

David_Tschumperle · September 12, 2016, 6:00am

Indeed.

I wouldn’t be so confident claiming that, at least before I see any noticeable optimizations.
CImg has been there since a long time (1999) and has been already optimized quite a lot. There’s probably slightly more to do, but not so much IMHO (without using some GPU specific instructions of course).

Because std::cbrt() has been introduced in C++11, and CImg is intended to be backward compatible with older versions of the norm (you can’t imagine how many people still try to compile projects with VC.6.0 nowadays, and complain about the fact CImg does not support it anymore).

Of course, you are welcome to do so, but please try to provide some benchmarks that shows the effect of a change along with a patch, and keep in mind that CImg does not apply to color images only, and that it still compiles with the C++98 norm, etc… (e.g. adding C++11 specific calls must be done within dedicated #ifdef... #endif).
I’m waiting for your patchs.

Thanks.

KaRo · September 12, 2016, 10:45am

Maybe you could try the octagonal dilate/erode. Its between square and circular!

Jerome_Boulanger · September 12, 2016, 11:12am

Hello hello,

David told me that my little filter DCP dehaze triggered a passionate discussion on CImg optimization. What I can say is that the library has been optimized for years and David has always been very keen and very responsive on integrating patches. Compiler are also quite good nowadays at doing auto-vectorization and approximative math so that optimization is often quite disappointing job to do.

The DCP dehaze filter was a quick draft to see how this works and is based on the following paper
K. He, J. Sun, and X. Tang, “Single Image Haze Removal Using Dark Channel Prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, Dec. 2011.

It make a “funny” assumption that most natural images have a “dark channel” and use it to recover a noise free image from an estimated transmission map. So no it is not a retinex like appraoch.

It is very possible that the code of the filter itself is responsible for the lack of performance. If you feel that this is necessary we can spend a bit of time on it.

Jerome

ps: David you got me!

garagecoder · September 12, 2016, 11:18am

@KaRo
Sadly the octagonal versions use the same method as circ; a mask supplied for each patch. I did think about a custom circle erode based on Bresenham to calculate endpoints for each row of the patch (you just keep adding a simple differential) but it’s debatable whether that would end up faster than a mask - CPUs tend to like a nice buffer to loop over. Certainly there’s no way it would be faster in G’MIC math processor because -erode with mask is a native command.

@Jerome_Boulanger
Nice to see you on here
Indeed nearly every time I’ve thought something about G’MIC core or CImg can be made faster, I eventually realise it can’t

iarga · September 12, 2016, 11:43am

Here is a Chinese blog with information about “dark channel prior” it also points to the same paper Jérôme Boulanger mentioned above.

(Google translate or something else)

Some people here might find this interesting?

Jerome_Boulanger · September 12, 2016, 11:58am

Hi again,

Just wanted to add that the implementation is loosely based on the article since I didn’t use the soft matting step as i replaced it with a simple median filter.
S. Lee, S. Yun, J.-H. Nam, C. S. Won, and S.-W. Jung, “A review on dark channel prior based image dehazing algorithms,” EURASIP Journal on Image and Video Processing, vol. 2016, no. 1, Dec. 2016.

Other more recent approaches are also available and use other assumption than DCP.

Jerome

heckflosse · September 12, 2016, 12:03pm

Ok, here’s a first quick and dirty patch. I just copied over some code from RT for the median of 9 values and used it in blur_median.
I benchmarked it using gmic image.jpg -tic -median 3 -toc -q where image.jpg is a 36 MP file.
Processing time on a 4 core machine (median of 7 runs)

before patch: 894 ms
after patch: 237 ms

Edit: Here is a patch which also includes median of 25 values.

I benchmarked it using gmic image.jpg -tic -median 5 -toc -q where image.jpg is a 36 MP file.
Processing time on a 4 core machine (median of 7 runs)

before patch: 7479 ms
after patch: 1330 ms

floessie · September 12, 2016, 12:46pm

Just a note: Though the code is C++11, it can be easily rewritten in C++98 with a slightly different interface. There is no C++11 magic about it.

David_Tschumperle · September 12, 2016, 1:03pm

Ok, I’m currently looking at your patch, which makes things faster indeed.
The surprise is: the cause of the speed gain is not the algorithm itself, but mainly the use of std::min() and std::max(), instead of my ‘own’ min() and max() functions. Looks like the compiler uses hard-coded functions for computing the min() and max() of two float values. If I use my own min/max functions in your fastmedian() function, I get very similar results as my previous code. So, I’m currently patching my min()/max() functions to make them use std::min/max() when possible. Not sure how I can enable this for C++98 users by the way.
I’ll let you know when this is ready.

heckflosse · September 12, 2016, 1:19pm

I guess the compiler can’t vectorize your ‘own’ min and max functions but it can vectorize std:min() and std:max() at least for float values.

floessie · September 12, 2016, 1:22pm

Instead of passing an std::array<> pass the parameters by value or have templated functions like fastmedian9(T*) where the argument is a pointer to nine T’s (reminds me of the 90s - the programming style as well ).

David_Tschumperle · September 12, 2016, 1:35pm

That is what I’ve done.
Anyway, it seems the optimization flags are not optimal. I compile G’MIC with -O3 -mtune=generic, and in this case, my min/max() functions are slower. If I use -Ofast, they become equivalent to std::min/max() (I’ve looked at the assembly code generated, to compare the two versions).

David_Tschumperle · September 12, 2016, 1:46pm

I don’t see any problem with that coding style. Most of the best coders have started coding in the 90’s
There are so much people advocating for fancy and “modern” syntax who do not realize the assembly code generated by the compiler is the same at the end (or sometimes even worse).
No need to be pedant with good old programmers.

garagecoder · September 12, 2016, 1:53pm

I read it as a nostalgic comment rather than a criticism.
Anyway I’m excited about any speedup, regardless how it’s done