PhotoFlow: new caching mechanism - TESTING NEEDED!

Carmelo_DrRaw · September 23, 2019, 11:44am

@afre @Silvio_Grosso I have added both a tooltip for the relight button, and a close button to the soft-proofing dialog. It is true that this was already mentioned in the past, but went over my head. Thanks for the reminder!

The fixes are committed to the stable branch only.

Carmelo_DrRaw · September 23, 2019, 4:35pm

@paulmiller @gadolf @afre I have introduced two changes, now that it is more or less clear that the issue is in the multi-threading part:

I have made all tile caches persistent, which means that tiles are never dropped at the end of the processing
I have added further refcount consistency check at the beginning of the tile processing, to see if the wrong refcounts are there from the very beginning or are introduced during the tile processing.

Once more, your feedback will be very much appreciated, because on my system I can scroll like hell and nothing bad happens!!!

The new MacOS package should be ready in a short while…

afre · September 23, 2019, 5:01pm

You should make a scroll_like_hell branch just for kicks. It isn’t only about scrolling rapidly or scrolling lots but also leaving PF open for a long time.

paulmiller · September 23, 2019, 5:22pm

Is that 1c461?

Its still crashing (but with a slightly different stack trace - there are a couple of extra vips_* calls in there):

Thread 19 Crashed:: worker
0   libsystem_kernel.dylib        	0x00007fff5989e2c6 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff59959bf1 pthread_kill + 284
2   libsystem_c.dylib             	0x00007fff598086a6 abort + 127
3   libglib-2.0.0.dylib           	0x0000000108070688 g_assertion_message + 423
4   libglib-2.0.0.dylib           	0x00000001080706e6 g_assertion_message_expr + 94
5   photoflow                     	0x0000000105aef8fc phf_tile_cache_gen + 2172
6   libvips.42.dylib              	0x0000000107ca486d vips_region_prepare + 253
7   libvips.42.dylib              	0x0000000107bd8639 vips_embed_base_gen + 121
8   libvips.42.dylib              	0x0000000107ca486d vips_region_prepare + 253
9   photoflow                     	0x0000000105af6264 vips_layer_gen(_VipsRegion*, void*, void*, void*, int*) + 196

photoflow_log6.txt (13.9 KB)

gadolf · September 24, 2019, 11:17am

This is fast, even at 1:1.
I had this crash after inserting a gradient layer early into the pixel pipe (I probably shouldn’t, but I did it)
debug2.txt (128.3 KB)

paulmiller · September 24, 2019, 2:54pm

Just tried b9f2f:

It still crashes, but it took longer. I ran it with PHF_TILECACHE_DEBUG=1. The resulting log is 25MB!

photoflow_log7.txt.zip (1.5 MB)

Carmelo_DrRaw · September 25, 2019, 8:33am

@gadolf: there is nothing wrong with inserting a gradient layer anywhere in the layers stack, this should just work. However, the crash you observed does not seem to be related to tile caching, and would require running the program through gdb or valgrind in order to catch the exact place where crash occurs…

@paulmiller: I am happy to see that the crash is still there when using the verbose logging. Looking at the referencing/dereferencing history of the tile that generates the crash, I see the following:

phf_tile_ref: tile: 0x10f3c7338  ref_count: 1
phf_tile_ref: tile: 0x10f3c7338  ref_count: 2
phf_tile_ref: tile: 0x10f3c7338  ref_count: 3
phf_tile_unref: tile: 0x10f3c7338  ref_count: 3
phf_tile_unref: tile: 0x10f3c7338  ref_count: 2
phf_tile_unref: tile: 0x10f3c7338  ref_count: 1
phf_tile_ref: tile: 0x10f3c7338  ref_count: 1
phf_tile_ref: tile: 0x10f3c7338  ref_count: 2
phf_tile_unref: tile: 0x10f3c7338  ref_count: 2
phf_tile_ref: tile: 0x10f3c7338  ref_count: 3
phf_tile_unref: tile: 0x10f3c7338  ref_count: 1
phf_tile_cache_gen[1]: tile 0x10f3c7338 wrong ref_count 0, state=0, region=0x7facc62225a0, cache=0x7fac850ac970

In the last two lines on sees a jump in reference counting of 2 (from 3 to 1), which is not expected. However, I think we are in a good track to pin-down the issue.
If you don’t mind, I have prepared a new package with even more tile-related messages in PHF_TILECACHE_DEBUG=1 mode, hopefully it will let me figure out where exactly the mismatch is taking place…

Thanks!

I also got the impression that my recent changes made the preview even more responsive. Good to know that I am not the only one having this impression

paulmiller · September 25, 2019, 11:20am

Unfortunately, I haven’t been able to get 53faf to crash in PHF_TILECACHE_DEBUG mode yet. It still crashes if I turn that off. Maybe the extra printfs have disturbed the timing enough to fix the problem (does printf have an internal mutex?). I’ll keep trying.

Carmelo_DrRaw · September 25, 2019, 3:09pm

That is what I was afraid of… IN the version you have tested, each printf was followed by a flush of stdout in order to make sure to see the latest messages on the terminal. Maybe this is slowing down the processing too much, so I have prepared a new version without the flushing. Given that the crashes are not segfaults but are due to failed asserts, it should make no difference in terms of what is printed to the terminal.

I would suggest to give this new version a try. Also, what are your CPU specs? How many cores?

Thanks!

paulmiller · September 25, 2019, 3:20pm

2,6GHz i7, 6 cores (12 threads).

paulmiller · September 25, 2019, 4:11pm

3f04b is not cooperating either: it will crash with PHF_TILECACHE_DEBUG off, but not with it on.

Edit:
I managed to get 3f04b to crash:
photoflow_log8.txt.zip (6.1 MB)

Carmelo_DrRaw · September 26, 2019, 7:50am

I have an i5 2.9GHz with 2 cores, so this might explain why my system is not fast enough to trigger the crash.

I now suspect that the issue is due to concurrent increase/decrease of the tile reference counters (which seems to be confirmed by your latest log).

I have now protected the phf_tile_ref() and phf_tile_unref() functions with a global mutex, and also decreased again the verbosity level with PHF_TILECACHE_DEBUG=1 (you can still get more verbose output with PHF_TILECACHE_DEBUG=2).

As usual, I am not getting any crashes on my system…

The new package is ready for download.
Thanks!

paulmiller · September 26, 2019, 3:24pm

I’ve been running 9596d all day (about 45mins total) with PHF_TILECACHE_DEBUG=1, and it hasn’t crashed yet. Looks like you may have fixed the refCounting race condition

There are a few problems that still need looking at:

Some layers don’t update the image when you change some settings (e.g. ‘Gamut Warnings’ on the ColourSpace conversion layer)
The crop tool gets confused if you try to edit a crop (related to the above I think - you get the crop box displayed over the cropped image and offset)
The image ‘zoom to fit’ is not the same colour as at 1:1 or when exported. The preview looks more saturated (more obvious with some images than others).

Carmelo_DrRaw · September 26, 2019, 4:52pm

I really hope so! It has been a real PITA…
Did you also try without debugging messages?

I have just started to work on a proper fix for this. If you could help me catch the controls that do not update the image, it would be great. I have already taken note of the gamut warning and the crop tool

Could you provide me a good example? Might be due to the resizing algorithm that I am using…

Thanks!

paulmiller · September 27, 2019, 2:25pm

Here is an example of the colour shift I mentioned:

The exported image:

The view in PhotoFlow:

Note that the vegetation in the foreground looks reddish.

The view in Photoflow at 1:1 scale:

Note that the colour appears to match the exported jpg.

Here are the source files:
DSC_1018rl.pfi (42.1 KB)
DSC_1018.NEF (14.4 MB)
These files are made available under a CC BY-NC-SA 4.0 Licence.

Hope this helps.

age · September 27, 2019, 3:46pm

Probably the color shift is due to the CA correction in the raw developer module not updated in the “fit to screen” resize

Carmelo_DrRaw · September 27, 2019, 3:58pm

@paulmiller @age
I confirm the color shift, but I rather suspect the demosaicing. At zoom levels below 1:1 a simple fast demosaicing is used. With the past caching mechanism, the cached image was computed at 1:1 and the re-scaled, while not the caching happens at the zoom level requested by the preview. I need to find a proper fix for this, that is force the demosaicing at 1:1 and re-scale on the fly.

I will post an update once this is done…

gadolf · September 28, 2019, 2:07pm

Not a crash, but a freezing.
Follow gdb output.

phf.txt (115.7 KB)
EDIT: The freeze happened when I tried to move the linear range slider to the left, after moving it successfully to the right.

Carmelo_DrRaw · September 28, 2019, 8:47pm

As usual, I am not able to reproduce the freeze, but I am not very surprised… those seem to be quite subtle bugs, and difficult to reproduce.

Your gdb output shows however that the freeze was actually a consequence of a segmentation fault. However, the log only shows the latest function call in the stack, not the full backtrace. Next time please run the following command at the gdb prompt, to get the backtrace from all active threads:

thread apply all bt full

This is better described in the bug reporting documentation of RawTherapee: How to write useful bug reports - RawPedia
Most of what is described there also applies to PhF…

Thanks!

gadolf · September 28, 2019, 10:39pm

I did it and got a new freeze, although I was able to do more work this time.

phf.txt (669.6 KB)