I’ve done this with ImageMagick. (Yes, honestly!) It is simple, automated, and fast.
The process for filming is: start cameras and digital recorder. Clap my hands where all devices can hear me. (A clapper-board would also work.)
In post, a script converts the audio recordings to images. Using image-processing techniques, it isolates the clap, and aligns the images using the clap. This gives the offsets to align all audio sources together. For convenience, the script adds silence to all tracks after the first so, using Audacity or similar, I can simply left-align the tracks.
EDIT to add: the images made from the audio are N pixels wide and one pixel high, where N is a large number. Pixel “brightness” is audio sample amplitude, mono not stereo. Theses are not waveform images. See Sound and vision.