Dedicated "input" thread and buffering

Frederic_Trouche · April 18, 2022, 11:58pm

Hi,

Browsing the sources, I saw that registration.c uses I/O functions from sequence.c in parallel omp loops. Thus, I/O are possibly - read(), actually, concurrent too, which might slow down performances in most cases.

Wouldn’t be interesting to dedicate one more thread to read frames, buffer them (eg. keep 32 frames in buffer), and let computing threads address buffer? (and wait a new frame if empty).

I think I/O thread in such context could run concurrently to the others omp’s max_threads, as current situation for n cores is n computations plus possibly n dense I/O. Whereas solution with a dedicated thread would be obviously 1 I/O thread (full throughput when frames are located on a single disk) and n computational’. System load and overhead would surely be lower.

What’s your opinion?

vinvin · April 19, 2022, 8:05am

Hi Frédéric, thanks for your contributions and messages.

It would surely be interesting to create a dedicated reading thread, but this makes the code much more complex and we have a limited time to work on Siril. It’s in fact a long time idea (see this ticket), and it’s a rather long task for which the benefits are uncertain:

It’s already quite fast, so the time to write the dedicated thread version may be lost if it’s not better
Sometimes reading in parallel is faster, less with SSD than HDD I guess. You say in most cases it’s slower, I don’t know, but even if it’s the case, by how much? Maybe it doesn’t change that much
It would have been a good thing for HDD to limit their wear, but it’s a bit late for those
In most cases we are limited by I/O, not processing power, I don’t feel this would change that
We are also often limited by memory, and keeping many images in memory while waiting for a thread to be available to process them is not efficient in these conditions. A level of complexity is added in the computation of how many images and threads can be used.

Be aware that we did something similar for SER and FITSEQ sequence processing, but for writing. We have a dedicated thread that takes images to write and does all the buffer management and writing, see io/seqwriter.c.

Feel free to try it of course, that would be a nice thing to evaluate.

Frederic_Trouche · April 19, 2022, 8:02pm

Well, to clarify this, I tried to do simple tests over my SDDs using variable numbers of parallel readings on large files.

My findings are that for parallel reads, SDDs seem to get decent performances, almost linear in fact… until a certain point where overall performance of parallel readings fall to about 50% of a single reader (notice that my SDDs got x4 or x5 throughput boost, which is consistant your point 1.)

Maybe we may try to mimic more precisely Siril non-sequential parallel reads on more hardware to see if optimization could be relevant, and figure out if all process could be optimized (no need to get higher I/O performances if this slows down whole computation).

I can propose me to write a short Python script using the mock-up I used for KOMBAT.

vinvin · April 19, 2022, 8:17pm

Thanks for the test. I think in a long run, threads will not be all reading at the same time, it be a bit more distributed during other threads processing time because of the differences in reading and processing times.

I can make some tests with your script if you make one.

Frederic_Trouche · April 20, 2022, 6:45pm

Hm I think about this a bit and now I suppose that script couldn’t mimic the whole process and be relevant to conclude here.

Further tests on few SER indicated that average time consumed on I/O section varies from 50 to 70% of total time (on KOMBAT, at least).

I also spotted than I get same total time regardless of 4 or 8-threads usage on Global alignment or KOMBAT (i assume this is the same on others algorithms). This point indicates that optimization can be achieved.

So, I will try to modify a bit KOMBAT to implement something that could be generalized then. At this point, here’s my broad view:

A pool of g_threads, with an associated shared buffer, will execute slightly modified alignment threads. They will at least expect one boolean argument to toggle thread to act as a pure I/O thread or pure Computing one,
Buffer will act as a size-limited queue (likely corresponding to, at max, ~number of cores - in order to keep memory usage comparable to the one we currently have),
I/O thread will feed buffer while Computing’s will consume data from it,
An extra, managing thread, will then balance I/O and Computing threads’ numbers’ in order to maximize overall throughput during alignments (once the maximum reached, I suppose manager will just get idle).

vinvin · April 20, 2022, 9:53pm

The balance between I/O and computing will change a lot from a computer to another.

Make sure you have enough memory to actually use 8 threads, maybe siril limits to 4 even if you set 8 because memory is the limit, it shoukd print this at the beginning of all generic sequence operations.

Indeed you probably need a thread pool to do that. For writing it’s easier in fact, a single thread blocking on the async queue was enough. If you haven’t seen seqwriter, keep this in mind: if you allocate an image in the buffer, it will be used by the read thread first, passed to a compute thread then either written or passed to the seqwriter, only when all has been done you can reuse it, so you need a synchronization for the buffer.

But I agree your solution could be better. In fact in the past weeks I’ve worked on threading the per-image threads, for cases where you have less images than threads available or not enough memory to creates as much image processing threads as hardware threads. Your scheduler could indeed balance I/O and computing.

A hard part is making this work in a generic way for all processing functions (calibration and registration at least, because stacking has a specific threading implementation), like the current generic function.

Looking forward to seeing that! Thanks!

Frederic_Trouche · April 20, 2022, 10:13pm

Balance I/O vs computing will change according to computer, but also according to SER used, for example.

This is not a problem as I plan to make an adaptative algorithm, which is supposed (theoretically) to address this.

Using 8 threads is not a problem neither: we can have thousands of threads (which are just light processes from OS perspective). I didn’t check if Siril limits down to 4 threads as I see that all of my 8 cores are used.

I think we should enrich principle of seqwriter: a single I/O thread wouldn’t be optimal here (assuming figures I got on my PCs at least). Buffer will ofc be synchronized (mutexes).

I think I’ll try to make a PoC first, and test it with KOMBAT (which I know obviously better). Then, results are good, I’ll generalize on others alignments. Finally, this could be brought on calibration or stacking. About this, I want to emphasize that, in some way, registrations have also a specific threading implemention through OpenMP; I call this “specific” as this is not optimal for this application (see non linear effects between number of threads and I/O efficiency on disk’s sub-system).

vinvin · April 20, 2022, 10:30pm

To understand better how siril threading works (not accounting for the extra seqwriter): we have a number of threads running, I call them image processing threads because each process one image, so we cannot have more threads than images and we limit the number of threads to how many images and the allocations they require to be processed fit in memory. Since a few weeks there are also the per-image threads. That’s what you could change at runtime with your scheduler. For example a single image processing thread (which is created by OpenMP in the generic function) can have two nested threads to speed-up the computation of the image.

Good luck

Frederic_Trouche · May 2, 2022, 7:02pm

Sorry I was away the whole past week, I didn’t work on this much.
I hope I’ll have more time next days; my first tests were quite deceptive yet, and I don’t really get why… More investigations needed at this point.

Frederic_Trouche · June 14, 2023, 5:35pm

Here I’m back on this topic, with “something”.

Basically, the idea is to replace the OpenMP threads by pthreads, which are affected to a “type”; current types are default, read, compute (and maybe “idle”, later, if this PoC is convincing enough).

Default mimics the standard behavior, where each thread is reading one image and registering it.
“Read” threads only focus on reading sequences. While “compute” are today affected to registration.

Well, this is only working with KOMBAT registration for now, but I think I could be adapted later. Still depending on benefits: the global algorithm is more complex, which will be something to consider.

Here’s current interface:
Screenshot from 2023-06-14 19-19-22

We could then choose among various profiles; here, a “manual configuration” is defined. I’ve did an automatic one, but it’s not taking all parameters for now. Indeed, for each type default/read/comp, we can define:

the number of related threads,
the number of images read at each execution (eg. 1.25, means “read at least on image”, then read 1 more image - 25% of time),
the number of images to be computed (same logic).

A “cache” for images waiting to be handled has been set up.

In manual mode, user can drag’n’drop threads from first column “Number”, to affect change one thread’s type.

Goal is to control efficiently I/O usage to that it matches disk sub-system, preventing overheads, and, hopefully giving a performance bonus.

For now, I can get a 5% boost vs 1.2.0-beta3, but I guess my SSD has already quite optimized flows. I’ll run more tests soon.

Also, I didn’t commit for now 'cause I didn’t re-organized the code, and I’d need more time to focus on testing and validate this approach (well at least on all my hardware).

I’ll keep you informed.