Rawsamples.ch and hosting raw sample files

Awesome! :slight_smile:

Yes, and yes, i understand that. So far i do not even know if doing it on Travis-CI will work (filesize limit?), or will not be too slow.

Yes, i certainly plan on trying to cache that. Or maybe Travis-CI won’t work for this, and we’ll need something with more direct control as proposed by @andabata. But i will try to make Travis-CI work.

FWIW, i am certainly NOT interested in running a sample for each ISO/aperture/etc through CI. And probably not so interested in garbage samples too. That should limit tarball size significantly :slight_smile:

I’ve been think about how to make things more redundant/resilient and was speaking with @andabata in IRC about making a backup with git/git-annex. Git-annex can scrape the data via RSS or WebDAV or any number of things. Then a special remote can be used to duplicate the data elsewhere.

1 Like

Congratulations to the new upload site: looks pretty good. I have a couple of suggestions, though:

  1. add some kind of upload timestamp to the table, so I can sort by that and see what is new.
  2. ideal would also be some rsync type of access to the raw files so I don’t have to download the whole 7GB all the time.
  3. consider some automatic file renaming after upload based on Exif data that makes it easier to identify interesting files when working with parts of the complete set locally. My suggestion: Make-Model-Mode-Number.ext where Number is just counting upwards for the all files where Make-Model-Mode is the same. Spaces and invalid characters, like ‘:’ should be replaced with underscores. E.g. Canon-Canon_EOS_5D_Mark_III-mRaw-0.cr2.

Can be added, just create an issue on github.

rsync would be an issue, in the way the data is stored (hashed based on the id it gets in the db). But other means of acces are always an option. I can always create something wich outputs in away you can process (xml/rss/cvs).

If you would use something from the previous point, i can do renaming on the fly. I won’t rename the files stored, but i can change the name when serving a download.

Thanks for the new method for file synchronization! Very helpful.

There is just one issue I am having right now regarding file naming: the two GITUP-GIT2-* files contain ‘:’ in their name, which prevents me from storing them on a FAT files system. Would it be possible to make sure no filename contains those kind of invalid characters, even if that meant you’d have to rename them?

There is one more thing I am having an issue with: the size of the set ;). Which went up from 7.x to 9.x GB in just two weeks. That is going to be fun…

Looking for ‘big’ files, I noticed there are 4 files that got imported from rawsamples.ch that are in fact no raw files at all:

  • RAW_HASSELBLAD_CFV.PPM
  • RAW_NIKON_D800_*.TIFF

They are of now value in this context. Please remove them.

Also there are 3 ‘old’ D800 files still present. They are all 14bit uncompressed, same as D800-14b-no-comp.NEF. The only difference is that they are digitally cropped to different sizes within the camera. From a rawspeed testing perspective they are all equivalent. For all I care, they should be removed as well. Maybe @LebedevRI would have some input regarding those 3 files?

By my estimates, we have samples for ~half of the cameras rawspeed supports/knows about. So the size will at least double, hopefully.

Yep, deleted that.

Hmm. They are raw files. Produced by camera. And, currently equivalently broken in rawspeed:

$ darktable-rs-identify RAW_NIKON_D800_L.TIFF 
Loading file: "RAW_NIKON_D800_L.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_M.TIFF
Loading file: "RAW_NIKON_D800_M.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_S.TIFF
Loading file: "RAW_NIKON_D800_S.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image

Maybe you are right and they came right out of the camera like this. But according to my definition they are still no raw files, no more than JPEGs coming from a camera are raw files. Those tiffs contain 8bit RGB data for every pixel. You can open them in any standard image viewer just as they are.

@LebedevRI: I just noticed that the TIFF files are gone now. Thanks. This still leaves the ‘old’ uncompressed 14bit files with different crops to be discussed. Do you see any value for them in terms of testing librawspeed?

@axxel

^ are you talking about the same files there ?

No, I’m talking about the RAW_NIKON_D800_14bit_*.NEF files.

Ah.
Well, those are certainly not plain LDR :slight_smile:
They are currently loaded via rawspeed (in dt), and do contain raw mosaic.
So these (and similar raws from other cameras) are staying.

The fact that they currently work is just dumb luck:
E.g. i know that in rawspeed, support for some nikon uncompressed raws is (was?) broken.
But some uncompressed nef’s work. So all raw samples are wanted.

They do contain CFA raw data (never said otherwise). I said the only difference (that I know of) between them and the new D800-14b-no-comp.NEF is that they are cropped in camera. So width/height are different while all codec related properties seem the same. Hence, I see no value for regression testing or maybe as little value as different ISO samples would have. I have not checked whether they do execute the exact same code paths, though.

TLDR: slight sample overlap is so much better than partial sample coverage.

just like Panasonic and Leica aspect ratios - not all of them, but some of them actually contain different raw crop. The fact that they work or don’t work is irrelevant. Nowadays they may be the same as normal raws from these cameras, and in next camera they may be completely different. That just just luck.

And sometimes, manufacturers (notably: panasonic, canon XXXXD/XXXD) even release (supposedly) the same cameraunder up to three different names. So everything is the same but the camera name.

No matter if we like it or not, r.p.u purpose is NOT to be just a source of different raw samples only for rawspeed testing with zero overlap. It is more generic than that.
Yes, it absolutely should have all the samples needed for rawspeed testing. But it absolutely should not NOT have other samples, which are slightly duplicating the existing samples (from testing perspective).

If we decide (we won’t) to deduplicate such things, it is just bound to blow up into our faces. Else, we will never know that some of them don’t work should the manufacturers decide to change something again.

Alright. I’ll look into some other means of reducing my local set to the minimum that still provides maximum coverage (as already discussed on github a while back). Well, iff I find the time and it turns out to be worth the effort.

But in the Nikon example you would then end up with the cross product of bit depth x compression x crop, which would be 2 x 3 x 4 = 24 files for each camera. A few extra files here and there might be one thing but blowing it up like this could be serious enough to think about what is really helpful and what is not. Like only saving enough samples so each property is covered at least once but not the full set of combinations. Maybe the combination bit depth vs compression is helpful (have not looked at the code). Just saying.

FWIW what i’m saying here is that we will not be intentionally deleting them just because.
However we are not asking for new nikon samples with different crop, as you can see in Of the following brands we like to have: on https://raw.pixls.us/

Like i said in previous comment, i don’t think that is up to r.p.u to decide.

Right, I did not notice that.

I just noticed something else though: rerunning lftp again I get tons of lines like the following:

Removing old file `Nikon/1 J2/DSC_0451.NEF'                               
Transferring file `Nikon/1 J2/DSC_0451.NEF'

It seems that basically all files get re-downloaded. Maybe the creation / modification time stamps all got accidentally updated somehow? Am I the only having this issue?

The data dir gets rebuild once a day. To build it i use the cameras.xml from rawspeed to get the normalized naming of makes/models. To reflect updates i just rebuild the datadir. Though, filedates remain the same, only directory dates change. I suppose you can tweak lftp only to mirror if the size is different.

Worksforme with lftp -c mirror -c -P 4 -v --delete.
You said you have old lftp version. Perhaps that is the reason.
Or, storage fs? :slight_smile:

1 Like

For what it is worth: --ignore-time does help, getting the newest lftp does not. The only explanation I have is indeed the ‘ugly’ fs (FAT32) ;). My space constraints leave me no other choice, though.

By the way: I therefore still can’t store the two GITUP files with ‘:’ in their filenames.