Dumping unmodified raw image data from raw files?

Again, the goal is not to ensure that the raw files are not modified.

Yeah. A large number of raw files with the same image data but modified internal EXIF (decent asset managers should NOT be mangling the file itself!!!) is a sign of a much more fundamental asset management issue that doing fingerprinting band-aids around instead of fixing the root cause.

But somehow hashing the results of rawpyā€™s imread() is also problematic?

The hash generated by librawā€™s output is dependent on the version of libraw, runtime parameters, etc. Itā€™s not guaranteed to be unique.

Thatā€™s why I want to hash the actual unmodified raw data.

No, Iā€™m trying to come up with a good way to index large numbers of images programmatically by using the image data. Hashing the raw image data seems to me to be the most correct way to do this.

Any asset management will still obviously be able to function in a complementary way.

Blockchain for DAM :wink:

I store all my raw files in git-annex, which hashes the file when you check it in and has an fsck command that letā€™s you check the file against the last known hashā€¦

Iā€™d say that modifying a raw fileā€™s metadata is not a great idea (thatā€™s why not a lot of programs will actually do it), but thatā€™s just me.

1 Like

This thread is getting a little off track regarding the purpose of calculating a hash of the raw image data.

The reason I want to do this is to determine uniqueness of the raw image data. Thatā€™s it.

I already know about sidecars and asset management schemes, but those things donā€™t help me with files that already exist in numerous places using different workflows/software/asset management, and are in different states of backup.

Knowing the uniqueness of the raw image data does.

< rant > Furthermore, there are cases where using sidecars is not possible, e.g. when using Adobe software with Adobe DNG raw files. But thatā€™s a discussion for another day. < / rant >

I donā€™t like to touch the EXIF metadata either, but thatā€™s only after I import the files into my workflow.

There are a few basic EXIF metadata changes that I do want to be permanent in the raw files themselves so that those changes arenā€™t dependent in the future on the presence of a sidecar file or the workflow or software/asset management scheme.

To give you an example, this is usually basic stuff like fixing timestamp/timezone information and adding GPS information for cameras that donā€™t support this. But sometimes I need to fix incomplete, incorrect, or missing lens data when using adapted lenses. There is also some level of sanitizing I like to perform as well before I even import the files into my workflow, e.g. removing star ratings or any sensitive information that was added in-camera. Then I can compare the hash of the raw image data of the original files with those of the modified files to check that everything is okay.

Aside from my personal use cases, there are plenty of other reasons where a unique fingerprint for the raw image data would be useful.

@rt985426,

Thanks for looking into what I interpreted from the libraw thread post; not sure yet what Iā€™ll do with it.

To your use case, youā€™re looking to anchor the fileā€™s uniqueness to the captured image itself. Iā€™d gather from that, if the important thing about what you pull from the libraw_internal scheme is that it never changes from pull to pull across the fileā€™s life, you donā€™t care what format/compression itā€™s in, you just want the blob to hash the same every time. I get it; even in my picture organization scheme I have filename duplication, and Iā€™m essentially relying on my directory structure to preserve unique identity. Not a really reliable scheme, but Iā€™m not doing this professionallyā€¦ :laughing:

Even if one doesnā€™t modify their raw metadata, there are things the OS does to the filesystem attributes that can render the file in total different from copy to copyā€¦

This is not applicable to raw files.

Itā€™s only for image types that php, python, and imagemagick can handle, e.g. jpg, gif, etc.

This wonā€™t work for raw files, either.

https://www.reddit.com/r/DataHoarder/comments/a7whd4/check_raw_dng_jpg_file_integrity_not_a_simple_crc/ I can keep trying

That wonā€™t work for raw files, either.

Why not??

[Iā€™m rt985426. Since I am a new user, I already reached my maximum post count for the day. So I just created this new account so I donā€™t have to wait another 15 hours to post againā€¦]

I already asked the exiftool author, and he confirmed to me that this method wonā€™t work for raw files without a lot of work. Itā€™s because IFD0 tags are specific to the raw file format and camera, and have to be deleted by name.

You can test it out on a few raw files yourself. It doesnā€™t work.

Itā€™s also unclear to me even if you could delete all of the IDF0 tags that donā€™t contain any raw image data if the resulting output would even be identical across different versions of the software using this method. This is important because I want the hashes to be reproducible and not dependent on the software, version, or runtime parameters.

I corrected the error of my ways; pull origin/master for the latest.

Doing the following:

$ ./reallyraw2dat DSG_3111.NEF DSG_3111.dat
$ ls -l DSG_3111.dat
-rw-r--r-- 1 glenn glenn 17885222 May  4 21:05 DSG_3111.dat

reports a file size the same as the StripByteCounts for SubIFD1:

$ exiftool -v DSG_3111.NEF
...
 | + [SubIFD1 directory with 17 entries]
  | | 0)  SubfileType = 0
  | | 1)  ImageWidth = 4992
  | | 2)  ImageHeight = 3280
  | | 3)  BitsPerSample = 14
  | | 4)  Compression = 34713
  | | 5)  PhotometricInterpretation = 32803
  | | 6)  StripOffsets = 1645472
  | | 7)  SamplesPerPixel = 1
  | | 8)  RowsPerStrip = 3280
  | | 9)  StripByteCounts = 17885222
  | | 10) XResolution = 300 (300/1)
  | | 11) YResolution = 300 (300/1)
  | | 12) PlanarConfiguration = 1
  | | 13) ResolutionUnit = 2
  | | 14) CFARepeatPatternDim = 2 2
  | | 15) CFAPattern2 = 0 1 1 2
  | | 16) SensingMethod = 2
...

SubIFF1 contains the raw image data in whatever format (compressed, etc) saved by the camera (SubfileType = 0: Full-resolution image)

[Iā€™m rt985426. Since I am a new user, I already reached my maximum post count for the day. So I just created this new account so I donā€™t have to wait another 15 hours to post againā€¦]

I tried the latest version and followed your method.

I was able to verify matching StripByteCounts for my Sony ARW, Canon CR2, and Nikon NEF raw files, so this looks very promising!

The output for my Panasonic RW2 files is still empty, so Iā€™m guessing itā€™s one of those more complex formats.

I also tried modifying the EXIF data for those Sony ARW, Canon CR2, and Nikon NEF raw files, and then compared the hashes of the raw image dumps for the original and modified files, and they are identical.

So it seems to be working for these raw file types.