G'MIC stability under OSX

I thought it would be better to move this outside of the PhotoFlow GUI discussion…

Continuing the discussion from PhotoFlow got a GUI relooking!:

Hello Andrea,

No, it doesn’t give any clue at a first glance. Most of the filters that fail are multi-threaded, but most of the filters that succeed are too O_o.

Maybe memory allocation? Are some filters creating/destroying intermediate images?

Remember that in the PhotoFlow case the parallel processing is not handled by G’MIC itself: I compile G’MIC without “is_parallel”, and the fftw3 stuff is also not parallel. I then feed tiles to the G’MIC filters in parallel… needless to say, this works perfectly under linux and with all my other custom filters under OS X.

I just had a quick look, and VIPS uses g_malloc() instead of plain malloc() for memory allocation… maybe Glib helps to keep OS X peculiarities under control?

I don’t think I use malloc() at all. I’m using the C++ new operator, which is probably thread-safe by nature
(otherwise, I guess no C++ code can be thread-safe :slight_smile: ).

I actually found few calls to malloc(), which I replaced with g_malloc() with no success…

However I’m still digging into the problem. Here is a stack trace of a typical crash I’m getting with the RL deconvolution filter:

Process 12675 stopped
* thread #10: tid = 0x1badd, 0x0000000101327574 photoflow`gmic& gmic::_run<float>     (this=0x00000001155209a0, commands_line=0x0000000113e38898,  position=0x0000000113e3884c, images=0x0000000113e123e0,  images_names=0x0000000113e123d0, parent_images=0x0000000113e38850,  parent_images_names=0x0000000113e38860, variables_sizes=0x0000000107146a00,  is_noarg=0x0000000113e38e54) + 3332 at gmic.cpp:4387, name = 'worker', stop reason =  EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000101327574 photoflow`gmic& gmic::_run<float>(this=0x00000001155209a0, commands_line=0x0000000113e38898, position=0x0000000113e3884c, images=0x0000000113e123e0, images_names=0x0000000113e123d0, parent_images=0x0000000113e38850, parent_images_names=0x0000000113e38860, variables_sizes=0x0000000107146a00, is_noarg=0x0000000113e38e54) + 3332 at gmic.cpp:4387
   4384	      // Process debug info.
   4385	      if (next_debug_line!=~0U) { debug_line = next_debug_line; next_debug_line = ~0U; }
   4386	      if (next_debug_filename!=~0U) { debug_filename = next_debug_filename; next_debug_filename = ~0U; }
-> 4387	      while (position<commands_line.size() && *commands_line[position]==1) {
   4388	        if (cimg_sscanf(commands_line[position].data() + 1,"%x,%x",&_debug_line,&(_debug_filename=0))>0) {
   4389	          is_debug_info = true; debug_line = _debug_line; debug_filename = _debug_filename;
   4390	        }
(lldb) print position
(unsigned int) $0 = 16

This reminds me of a strange problem I had a while ago when I started to compile and test photoflow under OSX: in my “curves” filter I was storing the pre-computed curve points into std::vector objects, and then I was accessing the vector elements simultaneously from multiple threads.

This was causing strange and unpredictable crashes in the curves code, always at the lines where I was either accessing the vector elements with the [] operator, or calling the size() function. Needless to say that the very same code worked perfectly OK under Linux.

In a rush of frustration I have replaced the std::vector objects with simple arrays of values, and now the code does not crash anymore.

This however points to the fact that the access to std containers might not be thread-safe under OSX, even when only reading values. I’ll try to write a simple test case to check if my suspicions are correct. Could this be a source of troubles in G’MIC?

1 Like

Well, I don’t use std:: container at all in the G’MIC code, but I use my own containers.
I believe this is somehow related yes. From the crash experiments done by Karsten, the log files produced by gdb could be interpreted as the fact that a thread cannot read memory allocated from another thread. This is really strange, as this is actually known to be something valid to do (maybe even POSIX I think).
If you had the same kind of issues, that may be a way to follow.
(but if this is really the cause, I don’t see any possible easy fix for this…).

I can hardly believe that reading memory is not safe… and I have a lot of places in my code where this actually takes place without any issues.

Do you still have some examples of gdb logs from Karsten? If yes, I could try to see if I find some common denominators…

By the way, my crash occurs in this statement:

*commands_line[position]==1

I’m wondering what this statement exactly means:

*commands_line[position]

I’ve searched the CImg.h header for an overloaded operator*(), but I couldn’t find any.

Moreover, the crash seems to be due to the use of a NULL pointer:

stop reason = EXC_BAD_ACCESS (code=1, address=0x0)

could it be that in this specific case, command_line[position] with position=16 is not initialised or invalid?

Here, commands_line is actually a CImgList<char> which can be seen as a ‘list of strings’ (see it as a kind of std::vector<char*>). It’s actually the sequence of G’MIC code items to read and execute by the interpreter. It should of course never contain a (null) string. If it does, then yes, there is undoubtly a problem somewhere !

Do all your crashes occur always at the same place for the same instruction ?
I don’t see why commands_line could become corrupted with (null) strings, especially because it is passed as a const reference to the main parsing function _run() !

I’ve still a doubt concerning what this instruction does exactly:

*commands_line[position]==1

It can be expressed as “take the CImg object at “position” in the vector and then deference the corresponding CImg”. Actually I would like to insert some debugging printouts in the dereferencing operation, but I could not find where the overloading is implemented in CImg.h
I’ve only found this:

//! Implicitely cast an image into a \c T*.
operator T*() {
  return _data;
}

but that’s not a dereference overloading…

Also, here is another stack track, this time after a crash in the “gradient norm” filter:

Process 501 stopped
* thread #10: tid = 0x1d37, 0x00000001013267b2 photoflow`gmic& gmic::_run<float>(this=0x0000000112ecbbf0, commands_line=0x000000011296e538, position=0x000000011296e4ec, images=0x0000000112948080, images_names=0x0000000112948070, parent_images=0x000000011296e4f0, parent_images_names=0x000000011296e500, variables_sizes=0x00000001065a5200, is_noarg=0x000000011296eaf4) + 130 at gmic.cpp:4318, name = 'worker', stop reason = EXC_BAD_ACCESS (code=2, address=0x1128eff18)
frame #0: 0x00000001013267b2 photoflow`gmic& gmic::_run<float>(this=0x0000000112ecbbf0, commands_line=0x000000011296e538, position=0x000000011296e4ec, images=0x0000000112948080, images_names=0x0000000112948070, parent_images=0x000000011296e4f0, parent_images_names=0x000000011296e500, variables_sizes=0x00000001065a5200, is_noarg=0x000000011296eaf4) + 130 at gmic.cpp:4318
   4315	#if cimg_display!=0
   4316	  CImgDisplay *const _display_window = (CImgDisplay*)display_window;
   4317	#endif
-> 4318	  if (!commands_line || position>=commands_line._width) {
   4319	    if (is_debug) debug(images,"Return from empty scope '%s/'.",
   4320	                        scope.back().data());
   4321	    return *this;
(lldb) print commands_line._width
(const unsigned int) $0 = 27
(lldb) print !commands_line
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=2, address=0x1128efe90).
The process has been returned to the state before expression evaluation.
(lldb) print commands_line
(const cimg_library::CImgList<char>) $1 = {
  _width = 27
  _allocated_width = 32
  _data = 0x000000010657f208
}
(lldb) print &commands_line
(const cimg_library::CImgList<char> *) $2 = 0x000000011296e538

In this case, the faulty instruction is !commands_line.

Again I was not able to find an overloading of the !() operator in the CImgList code… did I just missed it? If yes, at which line is it defined?

Thanks!

*commands_line[position] means “the first char of the code item in current position”.
position can be seen as a program counter in assembler. It gives the indice of the current G’MIC item to execute. There are special debug items which are not regular strings, but char* buffers starting with the value 1.
Debug items give info about the current line number and the command file associated to the current G’MIC command to be executed. The interpreter tries then first to ‘skip’ these particular items before starting to execute the next command. This explain the term *commands_line[position]==1 in the while loop.

No, your stack trace looks interesting. It would be nice to see what is the value of the variable position in that case, as well as the value of the image at commands_line[position]. Could you try then to show me:

print commands_line
print position
print commands_line[position]

to see if there is something strange ? Thanks by advance for your help !

Something seems to be badly wrong there:

   4385	      if (next_debug_line!=~0U) { debug_line = next_debug_line; next_debug_line = ~0U; }
   4386	      if (next_debug_filename!=~0U) { debug_filename = next_debug_filename; next_debug_filename = ~0U; }
   4387	      while (position<commands_line.size() &&
-> 4388	          *commands_line[position]==1) {
   4389	        if (cimg_sscanf(commands_line[position].data() + 1,"%x,%x",&_debug_line,&(_debug_filename=0))>0) {
   4390	          is_debug_info = true; debug_line = _debug_line; debug_filename = _debug_filename;
   4391	        }
(lldb) print commands_line
(const cimg_library::CImgList<char>) $0 = {
  _width = 331608000
  _allocated_width = 1
  _data = 0x0000000113c3eec0
}
(lldb) print position
(unsigned int) $1 = 16
(lldb) print commands_line[position]
(const cimg_library::CImg<char>) $2 = (_width = 0, _height = 0, _depth = 0, _spectrum = 0, _is_shared = false, _data = 0x0000000000000000)

Hope this helps!

Thanks !
So, yes commands_line seems to be corrupted here.
That is very strange, as it is actually a const reference, so very few chances it gets modified from one thread to another.
At least, this gives very interesting info about the issue.
Maybe you could try to add a line like this:

std::fprintf(stderr,"\nDEBUG : position = %u, commands_line :\n",position);
commands_line.print("DEBUG");

just after the main while() of the _run() function (around line 4433 in current git version), to see when commands_line starts to be something else ?

Shouldn’t that be

std::fprintf(stderr,"\nDEBUG : position = %u, commands_line :\n", position);
commands_line.print("DEBUG");

?

@Carmelo_DrRaw: * and ! are not overloaded C++ operators but good old C code. * dereferences a pointer and ! negates a boolean (or any integer value in C).

Yes exactly, I’ve updated the post, thank you.

Well, I’ve digged a bit more in the CImg code to better understand what happens, and actually the behaviour is not that simple.

The instruction

commands_line[position]

triggers an implicit call to this overloaded operator in CImgList:

operator const CImg<T>*() {
  return _data;
}

and then the non-overloaded operator is used to access the element at “position”, which returns a reference to a CImg object.

Next, when the * operator is applied to the CImg object, this in turn triggers an implicit call to this overloaded operator in CImg:

operator const T*() {
  return _data;
}

Finally, the standard operator * is applied to the returned T* pointer, to access the first element of the _data array.

The only potentially dangerous thing that I see in this implicit casts is that the validity of the indices is not checked. But this does not mean the code is wrong (and since it runs under Linux, I’m certain that it is OK)!

Unfortunately I think that the failure in the parsing of the command line is only a symptom, but not the cause of the corruptions. Actually, as soon as I slightly change the code in gmic.cpp and recompile, the crashes move into another place…

… but I’m not giving up!!!

Yes, in CImgList<T>and CImg<T> I use implicit casts operators to respectively CImg<T>* and T*.
In my case, a CImgList<char> is used to store an array of strings, each CImg<char>being seen as a string.
Having these cast operators allow me to consider CImg<char> directly as char*buffers, which is very convenient, since I can manipulate a CImg<char> as a regular C-string, with the flexibility of having a nice memory management (easy-reallocation, no new/delete to handle, automatic deallocation of the strings through the destructor ~CImg<T>, etc…).
This is something I’ve already done a lot, and as an image CImg<T> image is basically a T*buffer plus some info about the width, height, depth, ..., it really makes sense to do so.

That is why I’ve told in a previous post that a CImgList<char> can be seen as a something equivalent to a std::vector(char*), it’s semantically the same thing for me.

It perfectly makes sense for me as well… I’m also quite convinced that the problem is not in this part of the code, but I was wondering where I could put some additional printf’s to see what was going on internally.

Maybe the brute force solution would be to compile everything with GNU gcc? Seems to be possible: http://stackoverflow.com/questions/21173518/install-gnu-gcc-on-mac

I don’t think the issue is related to the use of clang++ or g++.
I’ve tried to compile G’MIC on Linux with clang++, and I get no errors at all.

I think the suggestion I’ve made in my previous post would greatly help, i.e.
insert:

std::fprintf(stderr,"\nDEBUG : position = %u, commands_line :\n",position);
commands_line.print("DEBUG"); 

at line 4432 of file ‘gmic.cpp’ (at current git state).
And then, try to run the gradient norm filter for instance. It will show the position and commands_line state at each G’MIC item executed. If you add -debug in the command line you send to G’MIC, there is high chance we can see where commands_line becomes garbage, and this would probably tell us a bit more about the issue.