Not many CPUs used for two processes, anything I can do?

Hi, I noticed when running an OSC_Processing script that I modified to process more than 2048 files with the following segment below, it now only processes one file at a time:

Convert Light Frames to .fit files

cd lights
convert light -out=…/process -fitseq
cd …/process

Pre-process Light Frames

preprocess light -bias=bias_stacked -flat=pp_flat_stacked -cfa -equalize_cfa -debayer

Align lights

First doing: register pp_light -2pass -noout

register pp_light -2pass -noout

Now doing: seqapplyreg pp_light -framing=max

seqapplyreg pp_light -framing=max

Here is the CPU usage and console screenshots during preprocess:

And here it is during Registration:

Is it because the lights are now in a big single sequence file that it’s slow?

it shouldn’t be much slower than before; what you’re showing is steps where the output result is being written, when all computation has already been done. Processing should still happen in parallel, but faster than before because writes are delayed like this. Of course if you’re using a rotating drive you will see this for a long time, but I think it’s not your case.

you are right, I have a fast nVME SSD (about 3.4GB/sec), but for some reason it’s very slow and spends all of it’s time writing out the files at only 25-75 Mbytes/second while using 6-30% of the CPU power.

I’m so used to seeing it dispatch many files/threads in parallel before, but with the FITs in a single big file, it seems to now spend the vast majority of it’s time during the very slow writing phase writing one file at a time.

It’s now applying the registration and writing the files out at about 1 file every 3.5 seconds. I’m happy it’s now working with more than 2048 files though, but it seems to be related to using “-fitseq”.

yes it’s related to the use of the single file format, but it should not be that slow!
Can you confirm the disk supports fast writes with other applications, like a large file copy?
Images of these dimensions are still quite large, 523MB per image, I’ll need to understand what’s happening for it to be so slow if the disk is as fast as you said.

I’m stacking with different options and will record the performance and CPU/DISK usage. I’ll also do some raw disk tests on very large files and see if there is something about that.

OK, I did a big stack of 3000 subs without -framing=max and without FITs compression and only single pass registration. It’s very fast and using all the CPUs. So I now need to vary the configurations and see which triggers the slowdown and the use of only a couple of CPUs.

Well, I couldn’t replicate it with a different data set… used the “-fitseq” option, “-framing=max”, and turned on RICE compression and it all worked and used all the CPUs. Maybe I just had an anomaly.

I did notice that not all files appeared to be compressed in the process folder, primarily the largest file. Here are the largest files in the folder representing 41.4GB of processed lights:

you also have r_pp_light.fit.fz, it probably happened when you tried different things…
I’m glad you’re not able to reproduce it now, but it’s still strange in the first place :slight_smile:

I’ll run it again, I tried to be diligent about removing the process directory between tests.

You are right, I had failed to clear the process directory between two tests. The second run had only compressed .fit.fz files. Nut loose on the keyboard!

OK, I’m back with more instances of where SIRIL doesn’t use many CPUs and goes very slowly when using “-fitseq” with a lot of files. I think I’m finally zeroing in on the use case specifics, but there does seem to be some variability in if it happens.

Below, I’ll copy the script I use, but first the description:

  • 2516 OSC lights each 4992x3340 in resolution, raw FITS captured with NINA. Each a short 6 second exposure using a dual narrowband filter
  • 100 bias frames
  • 25 flats
  • no darks

Computer: 8 Core Ryzen 9 5900HX with 64 Gigs RAM and a nearly empty 2 TB NVMe disk with 3.0G/second transfer speed.

  1. If I turn off FITS compression, it all goes quickly and stacks in 52 minutes
  2. If I turn on FITS RICE compression with quantization level 16.00 with 32-bit floating point, it takes 2 hours and 47 minutes. In this case it was about 3.2x the time, I’ve had instances with larger (26 megapixel) image files where it takes nearly 5x the time sometimes.

Here are the times for the major steps between the two:

                    No Compression            With RICE Compression

Preprocess: 11 minutes versus 77 minutes
Register: 16 minutes versus 64 minutes
Normalization: 7 minutes versus 6 minutes
Stack: 22 minutes versus 20 minutes

In all cases many CPUS are pretty fully used except in two cases: Preprocess and Register with FITS RICE compression, in which case it appears that only 1-2 CPUs are fully used.

Here is a CPU graph and a shot of the status window when it’s going slow:

image

Here is a CPU graph when it’s going fast:

image

You can clearly see the parallel nature in the status window of many files in process at once when it’s going fast and how many completions per second. For wen it’s going slow it seems to be one at a time.

The Script:

requires 0.99.4

Convert Bias Frames to .fit files

cd biases
convert bias -out=…/process
cd …/process

Stack Bias Frames to bias_stacked.fit

stack bias rej 3 3 -nonorm
cd …

Convert Flat Frames to .fit files

cd flats
convert flat -out=…/process
cd …/process

Pre-process Flat Frames

preprocess flat -bias=bias_stacked

Stack Flat Frames to pp_flat_stacked.fit

stack pp_flat rej 3 3 -norm=mul
cd …

Convert Light Frames to .fit files

cd lights
convert light -out=…/process -fitseq
cd …/process

Pre-process Light Frames

preprocess light -bias=bias_stacked -flat=pp_flat_stacked -cfa -equalize_cfa -debayer

Align lights

register pp_light

Stack calibrated lights to result.fit

stack r_pp_light rej 3 3 -norm=addscale -output_norm -out=…/result

cd …
close

Thanks for the tests and numbers! I think I understand what is happening:

  • using single file sequences like SER or FITSEQ in Siril requires images to be written in order in the file, and a special writing thread is created to handle that, for operations like calibration and registration that produce images
  • the compression is transparent to siril and managed by the FITS library, this means that the thread handling the writing will not do only I/O but also processing, the compression. So instead of spending its time writing an image and the next right after it, it’s compressing, writing, compressing writing and a lot of time is lost there. Unfortunately we cannot compress the image before passing it to the writing thread

That is a limitation of how compression works in cfitsio and the fact that a single file sequence is created. I don’t see a way around this.

Thanks for thinking about this. Your explanation makes perfect sense and matches the behavior of my test cases very well. If I do less than 2048 files with no -fitseq, it’s fast with and without compression.

It’s a bummer because I shoot a lot of short exposures and I’ve been doing some longer Intergrations (more than 2048 files) recently and the combination of SIRILs blazing fast speed, high quality stacking results, and the FITS compression keeping the potential temporary disk explosion down was just killer for me.

So now I either do:

  1. Just accept the non-parallel nature of long imaging runs and significantly longer stacking times with compression.

  2. Buy some super massive and super fast SSD drive. Right now 4TB is the biggest, which, if completely empty is probably enough for any current job. With the 12:1 expansion in data, this is good enough for about 300GB of captures or about 6000 lights.

  3. Test if stacking in chunks (say 1000 lights at a time) and then stacking all those results together generates similar results. Limited testing in the past indicates you get a loss in quality.

  4. Make some tradeoffs in my exposure time or FOV or resolution: Longer exposures trading off star/tracking/field-rotation quality to keep the number of subs below 2048. Or use the sub-frame exposure when the additional FOV isn’t needed to keep capture size down. Or even bin2 if I know I’ve oversampled substantially.

  5. Or wish for an improved FITS library that is parallel even with FITS compression

Any other possibilities you can think of?

you could also use regular FITS files, one per image, depending on the limits of your operating system

you could also force images to stay in 16 bits, they would already half the current size, and for short exposures I don’t think you will gain a lot from using 32 bits.

1 Like

Good idea, I’ll give the 16 bits a try, but I think I may want the final stacked image to be in 32-bits. Is there a way to do the registration and calibration in 16-bits but in the final stacking, let it go to 32-bits?

With respect to open files, I’m on Windows 10 and it only allows 2048 open files. Are there OS options that allow more open files? Mac OS, Linux, Windows 11?

OK, I tried 16-bits and I could detect very small differences but no general advantage of 32-float over the 16-bit integer results. 98% the same and in some areas the 16-bit had a very slight advantage and in others the 32-bit float. I did a lot of the comparisons after running the same level of BlurXterminator and NoiseXterminator because that is how I process them anyway. Again, very little difference. This reduced the intermediate data size by 50% and results in a more modest ~6x expansion on the original data size compared to 12x. Great find!

Performance was even faster, down to 41 minutes to stack 2,500 16-megapixel OSC lights. That is a pretty amazing speed!

1 Like

That was the default for years in Siril, until 32 bit support was added in 1.0.
You can also change the script to do the stacking only in 32 bits, adding a command set32bits before stacking and set16bits at the beginning, or manual operation by changing the value in preferences before stacking, but it’s a good way to forget about it for next time…

And apparently on windows the limit can be increased to 8192 files opened

1 Like

This is a great idea as I think it’s during the stacking when you want the additional precision. I’ll have to look up that command, I assume set 32-bits implies floating point?

correct

1 Like

If you haven’t seen it already, command reference is here: Commands — Siril 1.2.0 documentation