Continuous Droste Floating Point Exception (Bug?)

grosgood · April 27, 2024, 9:40pm

In gmic-community/include/martin_souphead.gmic:

    90	_souphead_droste10 :
    91	  repeat $! l[$>]
    92	    to_a       # Add an alpha channel if not there
    93	    +f. 0      # Create blank image of same size
    94	    sh. 0      # Work with just 1 channel to save time
>>  95	    f. "*
    96	begin(InnerRadius = $1;
    97	     OuterRadius = $2;

that is the start of the fill command related math expression implementing the droste algorithm. Referring to Multi threaded and in place evaluation in the math expression technical reference, if, instead of * one of the single-threaded variants (>, +) is set here, would that affect the problem? Make it go away? Could be useful if the issue is thread-related in any way.

David_Tschumperle · April 27, 2024, 9:44pm

Thanks for the logs!
Looking at those, I don’t see anything really critical, like an invalid memory access or something similar that could explain the crash. There are some non-freed memory at the end, but this doesn’t even seem to come from G’MIC but from third-party libraries used by G’MIC.
And valgrind indeed takes a lot of time for analyzing & running the executable.

At this point, the crash did not happen, at least with the versions you’ve been able to compile.

ThioJoe · April 27, 2024, 10:04pm

You’re onto something for sure, I replaced the asterisk with a plus sign like this and now it seems to consistently work fine even with the huge image:

_souphead_droste10 :
  repeat $! l[$>]
    to_a       # Add an alpha channel if not there
    +f. 0      # Create blank image of same size
    sh. 0      # Work with just 1 channel to save time
    f. "+
begin(InnerRadius = $1;

Also seems to work if I use a greater than sign > instead of the asterisk.

Here’s a version of the gmic file with just the droste filter but with that single threaded modification:
DrosteSingleThread.gmic.txt (11.5 KB)

More Tests:

I tried compiling with more debug flags like this
$(MAKE) "CFLAGS+=$(GMIC_STD_CFLAGS) $(DEBUG_CFLAGS) -fsanitize=undefined -fno-omit-frame-pointer" "LIBS+=gmic.o $(GMIC_STD_LIBS) $(OPT_LIBS)" gmic.o _cli

And got some interesting output.

I ran it with 1000.png and it worked the first instance:

joe@joe-ubuntu-vm:~/Repos/gmic/src$ ./gmic -input 1000.png -souphead_droste10 40,100,1,1,1,0,0,0,0,0,1,10,1,0,90,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0 -output output.png
[gmic]./ Start G'MIC interpreter (v.3.3.6).
[gmic]./ Input file '1000.png' at position 0 (1 image 1000x1000x1x3).
[gmic]./ Output image [0] as png file 'output.png' (1 image 1000x1000x1x4).
[gmic]./ End G'MIC interpreter.

But when I ran it a second time the console printed out an enormous amount of this and filled up the terminal:

AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL

Then I ran it on the big image and it gave this:

joe@joe-ubuntu-vm:~/Repos/gmic/src$ ./gmic -input Super.png -souphead_droste10 40,100,1,1,1,0,0,0,0,0,1,10,1,0,90,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0 -output output-super.png
[gmic]./ Start G'MIC interpreter (v.3.3.6).
[gmic]./ Input file 'Super.png' at position 0 (1 image 6000x6000x1x4).AddressSanitizer:DEADLYSIGNAL
=================================================================
==8001==ERROR: AddressSanitizer: FPE on unknown address 0x58925ec164a3 (pc 0x58925ec164a3 bp 0x7e6f739ea6f0 sp 0x7e6f739ea6d0 T6)
    #0 0x58925ec164a3 in gmic_library::gmic_image<float>::_cimg_math_parser::mp_mul(gmic_library::gmic_image<float>::_cimg_math_parser&) /home/joe/Repos/gmic/src/CImg.h:27759
    #1 0x58925ec09107 in gmic_library::gmic_image<float>::_cimg_math_parser::mp_if(gmic_library::gmic_image<float>::_cimg_math_parser&) /home/joe/Repos/gmic/src/CImg.h:25945
    #2 0x58925e68c75b in gmic_library::gmic_image<float>::_cimg_math_parser::operator()(double, double, double, double) /home/joe/Repos/gmic/src/CImg.h:23640
    #3 0x58925f84679c in gmic_library::gmic_image<float>::_fill(char const*, bool, unsigned int, gmic_library::gmic_list<float>*, char const*, gmic_library::gmic_image<float> const*, gmic_library::gmic_image<double>*) [clone ._omp_fn.1] /home/joe/Repos/gmic/src/CImg.h:33842
    #4 0x7e6f7d03cc0d  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1dc0d)
    #5 0x7e6f7ac94ac2 in start_thread nptl/pthread_create.c:442
    #6 0x7e6f7ad2684f  (/lib/x86_64-linux-gnu/libc.so.6+0x12684f)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE /home/joe/Repos/gmic/src/CImg.h:27759 in gmic_library::gmic_image<float>::_cimg_math_parser::mp_mul(gmic_library::gmic_image<float>::_cimg_math_parser&)
Thread T6 created by T0 here:
    #0 0x7e6f7c658685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x7e6f7d03d25f  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1e25f)
    #2 0x7e6f7d033a10 in GOMP_parallel (/lib/x86_64-linux-gnu/libgomp.so.1+0x14a10)
    #3 0x7ffdcfec1dbf  ([stack]+0x110dbf)

==8001==ABORTING

grosgood · April 27, 2024, 10:50pm

But running slower? A lot slower? Or does “work fine” also mean that processing time is about the same.

If you’ve read the cautionary accompanying these “threading hints” you’ll note that ‘*’ does not guarantee ‘faster!’ Maybe yes. Maybe no. Threading overhead can actually make multi-threading go slower. Mutex frenzy, perhaps.

If you do not notice a perceptible speed decrease, or do, but it is not painful, then perhaps changing that ‘*’ to a single thread model (+, >) is the way to go — an easy fix.

G’MIC has timing commands tic and toc. tic starts the timer. toc stops it and puts the results in a status string. echo ${} immediately following toc dumps the elapsed time to standard output. Could put tic just before f (fill) at line 95, then toc echo ${} at:

     528	ColorOutR*255"
>>   529	    k..  # Only second to last image should be kept
     530	  done done

just before k.. (keep). That could give you a notion of which variant is faster. Will try this myself, to satisfy my curiosity, but I have to walk Vinnie the dog first (it’s time).

This is how we turn people into G’MIC coders. You’re teetering on the lip of a slippery slope…

ThioJoe · April 27, 2024, 11:25pm

I just mean that it works in the sense that it actually finishes and doesn’t crash. Seems to be about the same speed for smaller images whether single or multi-threaded, but I can’t be sure about larger images because for those it crashes basically every time if it’s multithreaded. So at least it’s a workable solution for me.

I’m still going to experiment a bit with the debuggers.

I was able to catch it during a floating point exception with GDB after modifying CIMG.h a bit:

(gdb) catch signal SIGFPE
Catchpoint 1 (signal SIGFPE)
(gdb) run
Starting program: /home/joe/Repos/gmic/src/gmic -input Super.png -souphead_droste10 40,100,1,1,1,0,0,0,0,0,1,10,1,0,90,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0 -output output-super.png
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[gmic]./ Start G'MIC interpreter (v.3.3.6).
[gmic]./ Input file 'Super.png' at position 0 (1 image 6000x6000x1x4).[New Thread 0x7ffff1027640 (LWP 10417)]
[New Thread 0x7ffff0826640 (LWP 10418)]
[New Thread 0x7ffff0025640 (LWP 10419)]
[New Thread 0x7fffef824640 (LWP 10420)]
[New Thread 0x7fffef023640 (LWP 10421)]
[New Thread 0x7fffee822640 (LWP 10422)]
[New Thread 0x7fffee021640 (LWP 10423)]
[New Thread 0x7fffed820640 (LWP 10424)]
[New Thread 0x7fffed01f640 (LWP 10425)]
[New Thread 0x7fffec81e640 (LWP 10426)]
[New Thread 0x7fffec01d640 (LWP 10427)]
[New Thread 0x7fffeb81c640 (LWP 10428)]
[New Thread 0x7fffeb01b640 (LWP 10429)]
[New Thread 0x7fffea81a640 (LWP 10430)]
[New Thread 0x7fffea019640 (LWP 10431)]
[Switching to Thread 0x7fffea81a640 (LWP 10430)]

Thread 15 "gmic" hit Catchpoint 1 (signal SIGFPE), 0x0000555557c3d4ad in gmic_library::gmic_image<float>::_cimg_math_parser::mp_mul (mp=...) at /home/joe/Repos/gmic/src/CImg.h:27761
27761	      	double result = arg2*arg3;
(gdb) info locals
arg2 = 2.5
arg3 = -2.4492935982947064e-16
result = 5.2965016673825701e-310
(gdb) bt
#0  0x0000555557c3d4ad in gmic_library::gmic_image<float>::_cimg_math_parser::mp_mul (mp=...) at /home/joe/Repos/gmic/src/CImg.h:27761
#1  0x0000555557c30108 in gmic_library::gmic_image<float>::_cimg_math_parser::mp_if (mp=...) at /home/joe/Repos/gmic/src/CImg.h:25945
#2  0x00005555576b375c in gmic_library::gmic_image<float>::_cimg_math_parser::operator() (this=0x618000357880, x=5132, y=5286, z=0, c=0) at /home/joe/Repos/gmic/src/CImg.h:23640
#3  0x000055555886d7b1 in _ZN12gmic_library10gmic_imageIfE5_fillEPKcbjPNS_9gmic_listIfEES3_PKS1_PNS0_IdEE._omp_fn.1(void) () at /home/joe/Repos/gmic/src/CImg.h:33845
#4  0x00007ffff734bc0e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
#5  0x00007ffff5a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007ffff5b26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) list
27756	      }
27757	      
27758	      static double mp_mul(_cimg_math_parser& mp) {
27759	      	double arg2 = _mp_arg(2);
27760	      	double arg3 = _mp_arg(3);
27761	      	double result = arg2*arg3;
27762	        return result;
27763	      }
27764	
27765	      static double mp_mul2(_cimg_math_parser& mp) {
(gdb)

Since the actual answer to arg2*arg3 should be -6.123233995736766 × 10^-16 and not 5.2965016673825701 × 10^-310 something weird is going on. That’s a difference in nearly 300 orders of magnitude! I’m not very familiar with multi threaded debugging but maybe one of the other threads is messing with the values.

David_Tschumperle · April 28, 2024, 7:03am

That’s super interesting @ThioJoe , thank you!
I’m not at home right now, so it’s a bit hard for me to find time to investigate, but I will, next week.

Maybe this is related to the multi-threading processing of a shared image. This is not something filters do that often. Here are possible ways to explore:

Make the fill command works on a non-shared image instead, with multi-threading activated, to see if the problem comes from the shared nature of the processed image.
Try to play with environment variable OMP_NUM_THREADS to see if there is a limit on the number after which it crashes. Like:

$ OMP_NUM_THREADS=4 ./gmic (...)

(e.g. I expect this works just fine with OMP_NUM_THREADS=1).

David_Tschumperle · April 30, 2024, 7:52am

Hi @ThioJoe ,
I’ve modified the code of the Droste filter so that it does not use shared images anymore.
If you have 3.3.6 installed, could you please try:

$ gmic update

and then the command that makes G’MIC crash, to see if that modification has any effect ?
Thanks!

EDIT: I’ve also tested G’MIC exec on Windows (VM) with multiple CPUs enabled, but I don’t get any crash on this configuration either (I mean, with the previous code of the droste filter).

ThioJoe · April 30, 2024, 4:31pm

Oh no it still seems to do it

I might just have to do more investigating and debugging attempts on my side since it seems to be something to do with my machine.

I tried limiting the number of cores in the windows virtual machine (through the Hyper-V settings) and running the command and it seems when I limit it to 8 virtual cores or less, it works every time even with the 6000x6000 image. When I put it at 9 cores, the 1500x1500 one works every time, but the 6000x6000 image only works every once in a while, like 1/4 times. At 10 cores, the 6000x6000 fails every time, and the 1500px one failed the first time but worked subsequent times. All very odd.

David_Tschumperle · April 30, 2024, 4:50pm

Thanks for the tests !
Ah yes, that’s very odd indeed. At this stage, I don’t know what to do to help.
Do not hesitate to report other strange valgrind or sanitizer logs if you get some (particularly if it crashes while running).

Mushy · May 9, 2024, 4:19pm

Could it be a problem with BIOS settings?
Intel is apparently “…telling motherboard manufacturers to use its recommended BIOS settings by default to stop CPU instability issues with 13th and 14th Generation chips.”

Interesting recent article from El Reg:

David_Tschumperle · July 9, 2024, 12:36pm

@ThioJoe , does this bug still happens with latest release 3.4.0 of G’MIC ?

ThioJoe · July 11, 2024, 4:50pm

Yea still seems to happen. I think in the next couple days I’m going to try swapping out my CPU (13900KS) with my older 12900K because I’m really wondering if it’s not related to all the other growing stability issues that the 13900k and 14900K are having. I’ve also noticed odd behavior in other software as well. I’ll report back either way.

ThioJoe · July 11, 2024, 11:37pm

Actually another update, before swapping out the CPU I tried updating to the latest BIOS version of my motherboard (I previously had the 2nd most recent), and that actually fixed it. It doesn’t seem to crash even with very large images like 6000x6000. It even fixed it in 3.3.6!

I had also confirmed the problem was still there right before I updated the BIOS, because I was doing some testing before swapping out the CPUs anyway, but figured I’d give updating the BIOS a shot first. So I guess it contained a microcode update or something that fixed it.

Funnily enough, it looks like since I updated the BIOS this morning, there is actually an even newer version that literally just got released.

David_Tschumperle · July 12, 2024, 7:48am

Thanks for the update @ThioJoe , it’s a bit scary to see this kind of thing can happen. We could have spent a long time trying to understand what was going on on our side