Dumping unmodified raw image data from raw files?

rt985426 · May 4, 2021, 9:00pm

Again, the goal is not to ensure that the raw files are not modified.

Entropy512 · May 4, 2021, 9:00pm

Yeah. A large number of raw files with the same image data but modified internal EXIF (decent asset managers should NOT be mangling the file itself!!!) is a sign of a much more fundamental asset management issue that doing fingerprinting band-aids around instead of fixing the root cause.

Entropy512 · May 4, 2021, 9:02pm

But somehow hashing the results of rawpy’s imread() is also problematic?

rt985426 · May 4, 2021, 9:05pm

The hash generated by libraw’s output is dependent on the version of libraw, runtime parameters, etc. It’s not guaranteed to be unique.

That’s why I want to hash the actual unmodified raw data.

rt985426 · May 4, 2021, 9:09pm

No, I’m trying to come up with a good way to index large numbers of images programmatically by using the image data. Hashing the raw image data seems to me to be the most correct way to do this.

Any asset management will still obviously be able to function in a complementary way.

kmilos · May 4, 2021, 9:13pm

Blockchain for DAM

paperdigits · May 4, 2021, 9:57pm

I store all my raw files in git-annex, which hashes the file when you check it in and has an fsck command that let’s you check the file against the last known hash…

I’d say that modifying a raw file’s metadata is not a great idea (that’s why not a lot of programs will actually do it), but that’s just me.

rt985426 · May 4, 2021, 11:47pm

This thread is getting a little off track regarding the purpose of calculating a hash of the raw image data.

The reason I want to do this is to determine uniqueness of the raw image data. That’s it.

I already know about sidecars and asset management schemes, but those things don’t help me with files that already exist in numerous places using different workflows/software/asset management, and are in different states of backup.

Knowing the uniqueness of the raw image data does.

< rant > Furthermore, there are cases where using sidecars is not possible, e.g. when using Adobe software with Adobe DNG raw files. But that’s a discussion for another day. < / rant >

I don’t like to touch the EXIF metadata either, but that’s only after I import the files into my workflow.

There are a few basic EXIF metadata changes that I do want to be permanent in the raw files themselves so that those changes aren’t dependent in the future on the presence of a sidecar file or the workflow or software/asset management scheme.

To give you an example, this is usually basic stuff like fixing timestamp/timezone information and adding GPS information for cameras that don’t support this. But sometimes I need to fix incomplete, incorrect, or missing lens data when using adapted lenses. There is also some level of sanitizing I like to perform as well before I even import the files into my workflow, e.g. removing star ratings or any sensitive information that was added in-camera. Then I can compare the hash of the raw image data of the original files with those of the modified files to check that everything is okay.

Aside from my personal use cases, there are plenty of other reasons where a unique fingerprint for the raw image data would be useful.

ggbutcher · May 5, 2021, 1:13am

@rt985426,

Thanks for looking into what I interpreted from the libraw thread post; not sure yet what I’ll do with it.

To your use case, you’re looking to anchor the file’s uniqueness to the captured image itself. I’d gather from that, if the important thing about what you pull from the libraw_internal scheme is that it never changes from pull to pull across the file’s life, you don’t care what format/compression it’s in, you just want the blob to hash the same every time. I get it; even in my picture organization scheme I have filename duplication, and I’m essentially relying on my directory structure to preserve unique identity. Not a really reliable scheme, but I’m not doing this professionally…

Even if one doesn’t modify their raw metadata, there are things the OS does to the filesystem attributes that can render the file in total different from copy to copy…

priort · May 5, 2021, 1:25am

priort · May 5, 2021, 1:28am

rt985426 · May 5, 2021, 1:30am

This is not applicable to raw files.

It’s only for image types that php, python, and imagemagick can handle, e.g. jpg, gif, etc.

rt985426 · May 5, 2021, 1:31am

This won’t work for raw files, either.

priort · May 5, 2021, 1:49am

https://www.reddit.com/r/DataHoarder/comments/a7whd4/check_raw_dng_jpg_file_integrity_not_a_simple_crc/ I can keep trying

rt985426 · May 5, 2021, 1:52am

That won’t work for raw files, either.

priort · May 5, 2021, 1:55am

Why not??

rt966298 · May 5, 2021, 2:46am

[I’m rt985426. Since I am a new user, I already reached my maximum post count for the day. So I just created this new account so I don’t have to wait another 15 hours to post again…]

I already asked the exiftool author, and he confirmed to me that this method won’t work for raw files without a lot of work. It’s because IFD0 tags are specific to the raw file format and camera, and have to be deleted by name.

You can test it out on a few raw files yourself. It doesn’t work.

It’s also unclear to me even if you could delete all of the IDF0 tags that don’t contain any raw image data if the resulting output would even be identical across different versions of the software using this method. This is important because I want the hashes to be reproducible and not dependent on the software, version, or runtime parameters.

ggbutcher · May 5, 2021, 3:24am

I corrected the error of my ways; pull origin/master for the latest.

Doing the following:

$ ./reallyraw2dat DSG_3111.NEF DSG_3111.dat
$ ls -l DSG_3111.dat
-rw-r--r-- 1 glenn glenn 17885222 May  4 21:05 DSG_3111.dat

reports a file size the same as the StripByteCounts for SubIFD1:

$ exiftool -v DSG_3111.NEF
...
 | + [SubIFD1 directory with 17 entries]
  | | 0)  SubfileType = 0
  | | 1)  ImageWidth = 4992
  | | 2)  ImageHeight = 3280
  | | 3)  BitsPerSample = 14
  | | 4)  Compression = 34713
  | | 5)  PhotometricInterpretation = 32803
  | | 6)  StripOffsets = 1645472
  | | 7)  SamplesPerPixel = 1
  | | 8)  RowsPerStrip = 3280
  | | 9)  StripByteCounts = 17885222
  | | 10) XResolution = 300 (300/1)
  | | 11) YResolution = 300 (300/1)
  | | 12) PlanarConfiguration = 1
  | | 13) ResolutionUnit = 2
  | | 14) CFARepeatPatternDim = 2 2
  | | 15) CFAPattern2 = 0 1 1 2
  | | 16) SensingMethod = 2
...

SubIFF1 contains the raw image data in whatever format (compressed, etc) saved by the camera (SubfileType = 0: Full-resolution image)

rt966298 · May 5, 2021, 3:38am

[I’m rt985426. Since I am a new user, I already reached my maximum post count for the day. So I just created this new account so I don’t have to wait another 15 hours to post again…]

I tried the latest version and followed your method.

I was able to verify matching StripByteCounts for my Sony ARW, Canon CR2, and Nikon NEF raw files, so this looks very promising!

The output for my Panasonic RW2 files is still empty, so I’m guessing it’s one of those more complex formats.

rt966298 · May 5, 2021, 3:42am

I also tried modifying the EXIF data for those Sony ARW, Canon CR2, and Nikon NEF raw files, and then compared the hashes of the raw image dumps for the original and modified files, and they are identical.

So it seems to be working for these raw file types.