Is my computer really this slow?

Claes · September 10, 2016, 2:34pm

Hi all,

I have just played with Silvio’s photo DEHAZE-BUILDINGS.jpg (the original), as can be fetched at the bottom of his question here: How to dehaze pictures with dcp dehaze filter

Starting with G’MIC’s default settings for Details/Dcp dehaze I just changed Output to New layer(s) and Gamma to 35.00 and clicked OK. The result was quite interesting – but my computer needed 51 seconds to fulfill my command.

Is my computer really that slow or do you believe that I have missed something in the setup in The Gimp or in G’MIC?

paperdigits · September 10, 2016, 3:31pm

G’MIC isn’t blindingly fast from what I can tell. Sharpening on my core i5 box takes 10-15 seconds for me, and that computer isn’t slow.

Claes · September 10, 2016, 3:41pm

Hi @paperdigits,
If you have the time, could you please perform the same operation as I noted above, using the same photo and time it on your machine?

paperdigits · September 10, 2016, 4:16pm

Yup, just got my morning coffee, give me a few minutes

Silvio_Grosso · September 10, 2016, 4:44pm

Hi Claes,

Thanks a lot for your help !

On my Computer:
Windows 7 - 64 bit
CPU: I7 2630Qm
GPU: Nvidia 540M
RAM: 8 gb

With Gamma as 35.00 it takes 45 seconds on my image,
Judging by the task manager panel (taskmgr) it looks like this specific filter is multithreaded in that it uses all my cores (as regards the CPU i7) while running.

As far as I am aware, G’MIC filters are unable to use the GPU (to perform faster).

paperdigits · September 10, 2016, 5:41pm

On my computer:

NixOS 16.03 64-bit
CPU: AMD FX-8350 8 cores @ 4.0 GHz
RAM: 8 GB
GPU: ATI Radeon 8350

Default settings: 37 seconds
Gamma @ 35: 39 seconds

Claes · September 10, 2016, 5:51pm

Thank you, @Silvio_Grosso and @paperdigits,
Looks like that I am not too far off, then…

I am using an i7 920 @ 2.67GHz with 12 Gig RAM
and a lower class GPU (but it is silent !!!)

Jacal · September 10, 2016, 5:55pm

Another one:

Win7, Partha’s portable Gimp 2.9.5
i5 3570
HD 7850
16 GB RAM

31 seconds

Claes · September 10, 2016, 6:37pm

Thank you, @Jacal
Hm… I wonder if we dare to ask @David_Tschumperle about his execution time?
Presumably, he is running something super, high-fidelity de-luxe…?

David_Tschumperle · September 10, 2016, 6:58pm

Well, I cannot say… as I’m not the author of this filter !
The author, Jérome Boulanger may have an idea, but I’m not sure he reads these forums.

heckflosse · September 10, 2016, 10:21pm

As @paperdigits I also use an AMD FX-8350 8 cores @ 4 GHz. I don’t know anything about the implementation of Dcp dehaze, but I optimized the RT retinex stuff some time ago. 39 seconds for an 8 MP file sounds like a basic implementation without optimizations. RT retinex is at about 2 to 4 seconds (depending on settings) for a 36 MP file on the same cpu. Though I still don’t know whether I compare apples to pears here…

Ingo

patdavid · September 10, 2016, 10:51pm

What kind of wizardry do you do to keep incrementing optimizations the way you do?

heckflosse · September 10, 2016, 11:45pm

It’s not wizardry. It’s

finding the bottlenecks
knowing some math rules
knowing about cpu cache and memory access
knowing how to write fast vectorized code
knowing how to use OpenMP
knowing that even small optimizations sum up if you can make a lot of them

but the main thing is, if I’m stuck somewhere in optimization process, I go to bed and the next morning I know how to solve it or I know that I can’t optimize it any further (maybe that’s a kind of wizardry)

Edit: To help understanding the process of optimization I wrote some lines here in gamcurve_opt.txt. It doesn’t cover all the points I mentioned above (only 1., 2., 4. and 6.)

Claes · September 11, 2016, 8:37am

Hi @David_Tschumperle,

No, I was not complaining about the speed of G’MIC,
I was only désolé that my computer seemed so slow

Then I thought: ah, David most certainly must have a real computer,
let’s ask him for that particular execution time…

David_Tschumperle · September 11, 2016, 12:39pm

I’ll be able to say how my computer performs tomorrow, at the lab
Also some reminders :

G’MIC filters are developed using the G’MIC script language, which is thus not compiled but interpreted. This is of course a good reason why G’MIC execution time is almost always slower if you compare it with an equivalent algorithm that has been compiled in C/C++ for instance. On the other hand, that’s also why the 450+ filters in libgmic takes less than 6Mb, and why we have such an amount of filters (they are usually easier/shorter to develop).
The dcp dehaze filter in G’MIC is not the same as the Retinex filter. I can’t say precisely what algorithm is used behind (as I said earlier, I didn’t develop this particular filter), but its algorithmic complexity is maybe of higher order than Retinex , and comparing the two algorithms is probably unfair too. For what I know, Retinex seems to be a quite basic algorithm in terms of complexity.

Maybe Jérôme could tell us a bit more about this

David_Tschumperle · September 11, 2016, 12:57pm

For thos interested: After looking at the Jérome’s code, it seems the dcp dehaze filter he has implemented comes from this CVPR’2009 paper : http://research.microsoft.com/en-us/um/people/jiansun/papers/Dehaze_CVPR2009.pdf

I’ll probably ask him more info, this week.

garagecoder · September 11, 2016, 2:00pm

The time consuming part most likely stems from the loop containing “-median 3”. Whether the transmission map can be obtained without it is perhaps the problem…

heckflosse · September 11, 2016, 3:49pm

If I read the G’MIC/CImg code correctly, the -median 3 applies a 3x3 median (median of 9 values) to the image.

The parallelization is at channel level, which means it will not use more then c cores, where c is the number of channels in the image.
I don’t know how the channel data is arranged in CImg class, but in case it is arranged like e.g. RGBRGBRGB… the parallelization will mostly become ineffective because of cache conflicts when writing to res. In worst case it can happen that the parallelized version takes a lot more time than a single threaded version because of these cache conflicts.
But even when using only one core the cpu would read and write c * amount of memory than necessary.
If the channel data is arranged as different planes 1. is still valid.
In RT we reviewed our median code some time ago and the fastest we could get for median of 9 values was this. Using this we could reduce the time to median9 a float 36 MP file with 3 separate channels to about 210 ms (70 ms per channel) measured on the above mentioned AMD FX-8350 8 cores @ 4 GHz

David_Tschumperle · September 11, 2016, 6:30pm

Right. Jérome could use then command '-apply_parallel_overlap` that could split the median filter into N image blocs (where N is equal or close to the number of threads). Anyway, I’ve tested it quickly on a decent image (res 3000x2135) on my 4-cores machine, and it appears it becomes a bit smaller for the 3x3 median. This could be interesting probably for even larger images, but probably without a big difference :

Without the spatial splitting of the image:

$ gmic image.jpg -tic -median 3 -toc -q
[gmic]-0./ Start G'MIC interpreter.
[gmic]-0./ Input file 'image.jpg' at position 0 (1 image 3000x2135x1x3).
[gmic]-1./ Initialize timer.
[gmic]-1./ Apply median filter of size 3, on image [0].
[gmic]-1./toc/ Set status to '0.332'.
[gmic]-1./ Elapsed time: 0.332 s.
[gmic]-1./ Quit G'MIC interpreter.

With the spatial splitting (here, my machine has 4 cores):

$ gmic image.jpg -tic -apply_parallel_overlap \"-median 3\" -toc -q
[gmic]-0./ Start G'MIC interpreter.
[gmic]-0./ Input file 'image.jpg' at position 0 (1 image 3000x2135x1x3).
[gmic]-1./ Initialize timer.
[gmic]-1./ Apply parallelized command '-median 3' on image [0], with overlap 0 and 4 threads.
[gmic]-1./toc/ Set status to '0.449'.
[gmic]-1./ Elapsed time: 0.449 s.
[gmic]-1./ Quit G'MIC interpreter.

So, yes with more cores you should get something better. Anyway, even 332 ms to apply the median filter on such an image doesn’t seem to be excessive. I’ve tested with a 36MP image, and I get a computation time of 723ms on my 4 cores machine. Doesn’t sound so bad.

In CImg and G’MIC, image data are arranged by channel planes : RRRRRRRRRR…GGGGGGGGGG…BBBBBBB, so no interlacing of the channel data.

CImg also uses a special case for 3x3 median filtering. At this point, considering the computation times I get for the median filter (less than 1s for an image with a decent resolution), I don’t think the median filtering is the bottleneck of the dehaze algorithm.

garagecoder · September 11, 2016, 7:03pm

@heckflosse, @David_Tschumperle
Thanks for the interesting insights. What I meant was the entire loop may have been where most time was spent, I didn’t mean to imply -median was at fault specifically. Perhaps this assumption is also wrong, it needs proper testing. I hope I haven’t wasted too much of your time!