Dumping unmodified raw image data from raw files?

Upon examination of the output from your rawdata program, it appears as if the original raw image data has been converted to PGM format.

The file contents are essentially identical to the output of libraw’s bin/unprocessed_raw, but without the PGM header and with one additional trailing character.

Since the raw data has been modified, this isn’t suitable for what I’m trying to do since the output is dependent on how libraw generates this data.

I found this comment and code snippet on libraw’s website which uses libraw’s internal data unpacker, and I think this is closer to what I’m trying to do:

https://www.libraw.org/comment/1946#comment-1946

This was an interesting discussion…
https://exiftool.org/forum/index.php?topic=6659.0

If you’re concerned about compression being modified, I get it.

libraw_internal_data is a protected member of the Libraw class; you’ll need to get Libraw source, move it to public: compile LIbraw, and link to it instead of the libraw version installed by the OS.

Ugh, that sounds kludgy, but it may be the only way to do what I want unless you can think of a better way.

I wonder if the libraw maintainers would consider adding a public analog for libraw_internal_data.

Did a little more digging, and there’s a way to get a pointer to the data structure with this function:

libraw_internal_data_t *get_internal_data_pointer()

Line 303 in libraw/libraw.h. So:

libraw_internal_data_t * libraw_internal = RawProcessor.get_internal_data_pointer();

Rob Summner’s guide is indeed a fantastic tutorial on how to access raw pixel values and understand /toy with basic processing steps for those interested, and still applicable to Matlab/Octave AFAIK.

Btw, as dcraw seems to no longer be maintained, people might have better luck w/ dcraw_emu or unprocessed_raw tools from the LibRaw package.

@ggbutcher

import rawpy
with rawpy.imread('image.nef') as raw:
  raw_image = raw.raw_image.copy()

This however, doesn’t help the the OP as data is after unpacking/decompression, and I don’t think rawpy exposes that internal data structure as discussed…

So, I re-organized the rawdata repo to include what we’ve discussed here. I renamed rawdata.cpp to raw2tiff.cpp, and made two new programs, raw2dat.cpp and reallyraw2dat.cpp.

I also replaced the Makefile with a CMakeLists.txt. Funny, making the program changes took about 15 minutes. Writing that danged CMakeLists.txt took the better part of a day…

https://github.com/butcherg/rawdata

Good stuff. On that linked page:

reallyraw2dat: Uses an undocumented access method to the Libraw
internals to retrieve the unmodified (uncompressed, etc...) image data
from the camera raw file and write it to a .dat file.

I think you mean that your program doesn’t modify the data, and doesn’t de-compress it. The word “uncompressed” suggests that your program uncompreses it.

1 Like

Fixed: ‘uncompressed’ to ‘compressed’. I think that conveys it properly; let me know if it reads another form of ‘funny’… :laughing:

Also, I need to disclaimer reallyraw2dat.cpp: I have no real idea what is retrieved from that libraw location, except what’s asserted in the libraw thread post. YMMV…

Although if someone is being that strict - why not just hash the entire file?

When I run your reallyraw2dat program, I’m getting an empty output file.

I was having similar issues when I was trying to figure this out earlier. Even if I debug by using sufficiently large numbers for size and count for fwrite, the output file is still always empty.

fwrite (rawimage, 1000, 1000, f);

Also, libraw_internal_data->unpacker_data.data_size is giving inconsistent sizes for nearly identical Canon CR2 raw files (though I’m not sure if there’s anything wrong with that), identical sizes for Sony ARW raw files, and sometimes zero for others.

Hmmm… I ran it against one of my NEFs and got a decent-sized file, couldn’t find a way to verify the size with either exiftool or exiv2.

If the inconsistent sizes are close, within ~50 pixels or so, that could be the difference between the uncropped raw image and the image cropped of the sensor’s masked borders. I was contemplating outputting to stdout various essential libraw-supplied metadata like the width, height, rawwidth, rawheight, but became subsumed with figuring out CMake… :crazy_face:

I won’t be able to do anything else on this for about a week, but I’ll be thinking about it…

I’m not sure how to verify that the size (or raw data) is correct, either.

Maybe test by modifying the EXIF data for some raw files and make sure the reallyraw2dat output is still the same. At the very least, we would know that it isn’t necessarily doing the wrong thing. :man_shrugging:

I figured out what’s wrong: the data_offset is supposed to be for the input file, but your program is using it as a data pointer in memory.

btw, I asked the libraw maintainers about the apparent inconsistencies in data_size, and they said that is to be expected because many raw formats use compression.

Furthermore, they also noted that this low-level method won’t work for raw formats that use more complex data structures involving chunked data like for tiling or striping, and recommended using libraw’s unpack functionality which can already do this.

I don’t have a good understanding of the various data structures, so I’ll need to dig into this further.

Which again gets to the question:
What is your exact goal/use case here, and why do you need completely unmodified image data (including, apparently, not performing decompression if it is compressed?) but can’t just hash the entire file?

EXIF data can be modified.

So you would get different hashes for two files with the identical raw data but different EXIF data.

But that is, itself, a fundamental sign of tampering.

Indeed. If data integrity is the goal, why do a checksum of only a part of the file of interest? If you follow the forum long enough, you will find that the prevailing opinion to leave the raw file unaltered.

As for why the data isn’t consistently extracted, I have the feeling that there may be data offsets and length issues at play; add compression efficiency and types, padding, masking, etc., you have a mess.

2 Likes

My goal isn’t to determine tampering. My goal is to create a unique and reproducible fingerprint of the raw image data for the purposes of organizing large numbers of raw files.

Since I often modify EXIF data to the timezones/timestamps, add GPS info, or even add star ratings and copyright information in-camera, etc., a hash of the original file contents isn’t particularly meaningful.

That’s why I want to hash the raw image data.

This is what sidecar files are for!