[Feature request] Copy verification option

Thanks so very much for this great s/w. I especially love the file renaming flexibility. I have one request…
I would really like the ability to be able to verify that files are copied correctly. If it is already inherent in RPD I couldn’t find any reference to it in the documentation.

I really want to know that files have been copied 100% correctly before I delete the originals off the SD card. Would it be possible to add an option for RPD to generate md5/sha256 or somesuch for each file as part of the copy process, and to verify the file with the copy?
I’m thinking of something like creating a file.original.md5 in the destination folder with the checksum of the source file. After the copy, md5sum --check (for example) would be used to verify integrity. Perhaps using the tee command would allow the source file to only be accessed once for both the copy and the checksum generation so may not slow things down too much.
I appreciate it may slow things down and so may have to be a select-able option. Personally I’d be happy to accept a slowdown (or a second verification step) for the peace of mind it would give.

FYI, at the moment I use kdiff3 (or sometimes meld) to do a binary comparison after the copy (and on the backup), before deleting the original SD card.
Thanks again.

Just for reference, Filmulator silently retries up to 5x if the imported copy doesn’t hash the same as the original.

…I haven’t observed any issues, but I haven’t set up any error logging or anything so I wouldn’t really know either.

Although this would be better than nothing, notice that it would miss any problem in the reading during copying, or with the way RPD access the file. The best would be to checksum the file before copying, and using an external tool.

Let’s think this through carefully.

Like any modern OS Linux caches files in RAM when they’re read and written. Read it once from a file system, which is slow, and then read it again from RAM, which is fast. Same goes for write.

So imagine reading a file just to verify its checksum, before copying it. It’s now in RAM. You then copy it, which takes the bits from RAM, not the file system. Imagine there was a read error when you first accessed the file, to verify its checksum. You’d never catch the read error. Which OS level process performed the initial read is totally irrelevant.

The same goes for the file copy. You write to disk. The write is cached in RAM. You then read it to verify it, which again is from RAM. You don’t truly know if it was written correctly.

Now, I suppose it might be possible to instruct the Linux kernel to read straight from the file system, bypassing the RAM cache. Does anyone know how to do that without sudo access? It’s easy to sync and drop all caches in Linux, but it requires sudo and it affects every running process that touches the file system.

Really, the only reliable way to to verify the file is to read a checksum that the camera itself produces when it first created the file.

Question: which file formats embed a checksum in their metadata at the time the camera created the file?

Another question: given the ever present (and very serious for photographers) reality of bitrot, the initial file copy from the camera may not even be all that important. If you really want to preserve your files, you need a file system that guards against bitrot.

Personally I’ve never encountered a read / write error copying files from the memory card. But I have had previously working CR2 files go bad years later due to what I can only assume is bitrot.

Tools like git-annex can help with bitrot too, it isn’t limited only to the filesystem :slight_smile:

1 Like

Thanks for the explanation, I hadn’t thought about cache. Missing this:

the next best thing would be then:

  • checksum the files on the card
  • unmount the card
  • mount the card again
  • copy the file
  • unmount the drive (or reset the computer if not possible)
  • checksum the file

Certainly not practical…

What I started to do is to checksum all my RAW files on import, saving the checksum in an XMP sidecar file (which has the advantage of then being propagated to the JPG exports, so it doubles as the image ID). I haven’t connected this to automatically verifying the files yet, but the backups are checked periodically for bitrot errors.

I think one thing to consider is that the (consumer grade) SD card is probably a much more likely source of errors than your disk but I can’t find any data on sd card reliability.

The O_DIRECT flag to the open sys call might do the trick.

From man 2 open:

  O_DIRECT (since Linux 2.4.10)
         Try  to minimize cache effects of the I/O to and from this file.

But it looks like it comes with a few caveats as well.

Now even if this works and bypasses the page cache, the controller of the disk could still answer you from it’s dram cache. You could try to convince the disk not to do that using something like hdparm -W but you quickly start to fall into a rather deep rabbit hole.

If you care that much about data integrity you are likely better off building a workstation with ecc ram, multiple slc disks and using a file system that does check-summing like zfs or btrfs.

[quote=“damonlynch, post:4, topic:21338”]
Personally I’ve never encountered a read / write error copying files from the memory card. But I have had previously working CR2 files go bad years later due to what I can only assume is bitrot.
[/quote]Just out of curiosity. Did this happen on an old consumer HDD or on a modern SSD? I’ve personally never experienced (silent) bitrot on the latter.

so your camera has git-annex support and commits it directly on write? if not … in the scope of this discussion, not a helpful comment.

Looks like it! This statement from Linus is from thirteen years ago, but he calls it “totally braindamaged” and says “There really is no valid reason for EVER using O_DIRECT”. I have no idea what his more recent thoughts are. In any case, thanks for pointing that out to me. I guessed it had to be there in the Linux API somewhere.

Consumer HDDs, either 2.5" laptop drives or in more recent years 3.5" desktop drives. I actually don’t know when it happened though. I just noticed it a few years ago when going through some old CR2 files.

I think you’re right, but to be honest I’m kind of surprised it’s 2020 and camera manufacturers aren’t making built-in checksums at the RAW level a selling point of their cameras. Presumably it would be easy enough to implement, especially if the data being checksummed does not include the checksum itself! But I suppose clever mathematicians can figure out ways around that problem too.

@damonlynch I wouldn’t be surprised if ‘totally braindamaged’ is Linus way of saying sub optimal but. :wink:

It’s indeed a bit odd that these file formats tend lack means of checking file integrity.
As far as I know only png includes a crc of data blocks.

photorec cross checks jpegs against the internal thumbnail to get some hints:

Maybe the cameras to actually write some checksums into their ‘database’ files on the disk, but it’s going to be quite some work to reverse engineer those.

As things stand, if I want to verify the copy, then I can’t use one of RPDs most useful features – the file renaming. As a workaround, would it be possible to output a file/log with the original and copied dir/filename that could be used by an external script – or directly by md5sum/shaXXXsum?

I can understand not wanting to run as root. One solution would be to simply prompt the user to eject/remount source and destination drives (if removable), or re-logon if not. RPD could check for a pending verification at startup.

I think verification is pretty important – I would be really happy to make a donation to help with that.

That’s kind of you. But please understand that virtually no one will use a file verification feature if it involves rebooting — logging out will not be sufficient, as a file system cache is a kernel level feature, independent of the desktop environment.

For file verification to be done properly, it requires the following:

  1. Researching which raw file formats include a builtin checksum. As far as I know, only the DNG format does, but my information might be out of date.
  2. Researching which camera manufacturers whose cameras make RAW files with checksums actually implement the checksum feature, instead of ignoring that part of the spec.
  3. Researching how to instruct the reliably Linux kernel to read a file from its source, bypassing the cache, without sudo permissions. Maybe O_DIRECTis sufficient, maybe not.
  4. Researching a recent initiative undertaken by I think Adobe (and possibly/probably others) that is intended to one day verify files from creation right through the workflow — not so much to catch copying errors, but because of the requirements of legal verification i.e. digital forensics. Although the goal is different, there might be something useful for there regarding copy verification. If camera manufacturers get on board with this initiative, that might be useful for us who would like copy verification.

Are you willing to fund someone to do this basic background research, or to spearhead a fundraising effort that would enable this?

I didn’t realize that. That would be a problem.

I’ve never done any funding campaigns, but I’d be willing to give it a try if it would help.
To get around the cache flush, how about filling RAM with nulls or random data? If the RAM gets filled before the verify step, wouldn’t that effectively force the data to be re-read?
As for ejecting and remounting (without rebooting), I think you’re saying that would work for the removable SD cards?
For the local storage, would that not also work for users that have the storage on removable mount points?

Me neither. I guess the first and probably most important step would be identifying which platform to use.

Taking such an approach would like have the program permanently ejected from being distributed by any self-respecting Linux distro :wink:

Yes, because the OS has no way of knowing what happened to the SD card in the intervening period.

Unmounting file systems is problematic and I’ve never suggest it, including for removable mount points, for several reasons.

The best solution is:

  1. to do it through code, understanding how the Linux kernel expects it to be done (there must be a way), and
  2. understanding how the broader industry handles it, e.g the Adobe initiative, if it’s still a thing.