Verifying backups in Filmulator

CarVac · April 3, 2019, 3:16am

Filmulator keeps track of files by their md5 hashes.

It also lets you simultaneously import to a main working drive (for faster access) and an online backup (for bulk storage).

When you’re done with the stuff on the main drive, you can rsync the output jpegs to your online backup, and all the jpegs and raws together to an offline backup, and then point Filmulator at the online backup for any future editing.

So this past week I was doing this, rsyncing my 2018 files from my SSD to internal and external hard drives. However, upon doing the ‘Import in place/Update file locations’ in Filmulator I found a few broken files, the first time ever.

The ones on my external hard drive, which had just been copied from my SSD were fine.

That meant that the backup copy failed.

In the code here the first thing Filmulator does is open the file from the source and compute the hash of it.

Next, it copies the file to the backup, and finally it copies it again to the main working directory.

Why might the backup copy have failed when the main copy worked? I don’t know. Does it have to do with filesystem caching? I’m not at all sure.

In any case, in a commit on a new branch I have Filmulator verify the hashes after writing to disk.

If I were really paranoid, I would have Filmulator compute the source file’s hash three times over and ensure that they’re all the same before proceeding in case of card reader wonkiness… what do you think? Would that even work, what with file caching?

Let me know what you think.

afre · April 3, 2019, 4:21am

I read a lot about hashing too many years ago; I have no recollection of any of it. Maybe explore other types of hashes, and how and when the hashes are calculated or checked. Weird things might be happening on SSDs. I think they do their own hashing as well… And have their own set of copying techniques that may interfere with the copying of certain files… Don’t know; just thinking random thoughts out loud.

CarVac · April 3, 2019, 4:23am

Well before, I wasn’t verifying any of the hashes at all. It would copy and then proceed.

Now I have it verify the hashes and retry if it fails.

mbs · April 3, 2019, 11:26pm

It might also be a good idea to call fsync() after writing – it might help data make it to the physical media instead of staying in a cache. See man fsync.

darix · April 3, 2019, 11:48pm

In 2019… really?

CarVac · April 3, 2019, 11:49pm

In 2014, actually. I was a noob back then.

darix · April 4, 2019, 12:00am

CarVac · April 4, 2019, 12:06am

Yes, I know it’s broken as a cryptographic hash.

It’s used for identification/deduplication and checking against bitrot, not as a security measure.

Which means it’s slower than necessary, but… it’s far from being a bottleneck.