Rawsamples.ch and hosting raw sample files

axxel · December 18, 2016, 2:02am

I’m also in a discussion with @LebedevRI about a regression testing tool for rawspeed. I am wondering: has anyone tried to contact Jakob Rohrbach from rawsamples.ch and ask him about the current (and expected future) state of his site? If so: what did he have to say about it?

heckflosse · December 18, 2016, 2:03am

I agree that we need (it would be great to have) a source for raw files.

But why do we need it?

There are technical reasons common to all raw development software like getting proper black and white levels and raw crops for example.
The same is true for raw files to check correct decoding (corrupted raw files, manufacturers incorporating new compression formats etc.)

But there are also differences in requirements:

DT needs to get some data for raw denoise iirc (please correct me if I’m wrong here)

RT needs to get proper white levels from raw files scaled by ISO and aperture (means even more raw samples). In fact it does not need it, but it supports it!

PF needs (don’t know, @Carmelo_DrRaw )

Maybe there are more than the above mentioned differences.

Doh, now we have the first difference. That means we have to make a superset of all requirements.

I’m absolutely not against that.
I just want to mention it and that in this case a cross-ref to cameras.xml may not be enough to fulfill the requirements.

Carmelo_DrRaw · December 18, 2016, 8:10am

PhF is “stealing” the RAW decoding from DT and RT, so it does not have additional requirements so far…

LebedevRI · December 18, 2016, 9:50am

All this “different requirements” are just different samples, per camera.
The camera list is still the same, for every program. Because there is a fixed list of produced cameras.
So i completely fail to see why would it not be possible to generate camera list from cameras.xml.

Edit: yes, cameras.xml does not list all the cameras ever produced, so additional sources may be used, like RT’s camconst.json(?); and there should pobably be a way to select “sample for camera not in the list” during upload. The camera maker and model can be trivially extracted from Exif/Makernotes

Ofnuts · December 18, 2016, 11:04am

Why tarballs… people aren’t going to download hundreds of small files, and the files are already in a compressed format. A hierarchical directory would be enough (people would just download a subtree…). With Unix links you can have several different hierarchies without duplicating the files.

Something to take in account is the network bandwidth. Of course DigitalOcean isn’t really metering bandwidth (if we put this on DO) and this may not be a very popular site, but this is still a pretty big repository. So having some way to limit the bandwidth for anonymous users (those using HTTP/FTP, for instances) and a speedier method for identified people (SCP/SFTP using SSH keys, for instance).

LebedevRI · December 18, 2016, 11:50am

Tarball, unlike zip/etc is just an archive file, which then later can be optionally compressed, via separate tool. Tarball does not implicitly imply that it is additionally compressed. I do agree that in this case there is likely no point in additional compression.

This actually seems like an argument in favor of tarball. Last time i checked, downloading lot of small files is always not faster than one big archive with these files. And not easier.

From CI point of view, the complete tarball should have a “last changed” timestamp (next to it?), which can be checked, and if it does not match the locally-cached version, then the tarball will be downloaded and cached locally. With lots of small files, that will be more complicated i imagine.

Edit: or maybe git-lfs is the middle ground here, don’t know, did not use it before. But even then, for ci there is no interest in all the samples, only in unique-ones. So naturally tarball seems like the simplest choice.

patdavid · December 18, 2016, 2:22pm

I just wanted everyone to know that I am following this conversation carefully but am currently traveling. I will be posting later (maybe tomorrow), so please don’t think I’ve gone missing.

Ofnuts · December 18, 2016, 9:28pm

I think your misunderstood me… and we agree. Raw files are typically over 10Megs so you don’t donwload that many of them. Tarballs make a big difference when you download hundreds of small files like source code…

houz · December 18, 2016, 9:51pm

@LebedevRI needs all of the files present for automated testing. While it is wasteful to re-download a few GB just because one single image was added, it is much easier to script.

patdavid · December 19, 2016, 2:51am

I can use your cameras.xml as a basis for the required list, I think.

Do you mean to suggest that each CI build or test will have to download the entire tarball? This might be a little restrictive given the probable size of a final tarball. Depending on how you’re implementing CI (Travis?), it might make more sense to cache these on the build system? For instance, I cache npm modules and the build state of pixls.us between CI builds - this might make sense for what you want to do?

Couldn’t agree more.[quote=“LebedevRI, post:5, topic:2882”]
There needs to be a very simple and friendly web-ui for sample submission.
[/quote]

Absolutely agree.

Fair enough!

Good point. I will write an email to him tonight to find an answer to these questions. I certainly don’t want to duplicate effort if it can be avoided.

Agreed, though right now I’d like to try and solve the (harder imo) problem of getting folks engaged enough to help and upload files. We can certainly re-visit this.

Yes, I figure we’ll be looking at some large datasets, but the good side is likely a smaller number of transfer requirements. That is, I’d be surprised to find folks downloading everything multiple times in a day.

My bigger concern here is to remove these types of problems from the smarter folks who are doing the programming and creating software. I think we can handle the infrastructure so y’all (I am in the southern US) don’t have to worry about it. It should _just work_™ work for you.

At the moment I am trying to break this into some manageable tasks once we figure out some architecture answers. I’d like to keep the site and interaction static if possible, and I think we might be able to do this with the infrastructure we already have in place. It will require some work though.

In basic terms, we need.

An upload mechanism that is simple and low-effort for someone to participate and upload a file.
Ideally, I’m envisioning a user to be able to see a missing make/model with a “+” or “upload” button right there. Let them pick and upload a file directly to our infrastructure.
A mechanism for extracting information that we may need to sort/filter with.
A means for displaying the information and allow downloads.

I’ve taken a quick look at the Amazon AWS Lambda as suggested by @jinxos, which looks like it might be a nice fit. I can build a POST form to push the upload into a bucket, we can trigger the Lambda function to finish up processing for us (including pushing necessary files to support 3).

I’ll be back with more when I get some time to write down my thoughts. Getting late here.

LebedevRI · December 19, 2016, 11:00am

Awesome!

Yes, and yes, i understand that. So far i do not even know if doing it on Travis-CI will work (filesize limit?), or will not be too slow.

Yes, i certainly plan on trying to cache that. Or maybe Travis-CI won’t work for this, and we’ll need something with more direct control as proposed by @andabata. But i will try to make Travis-CI work.

FWIW, i am certainly NOT interested in running a sample for each ISO/aperture/etc through CI. And probably not so interested in garbage samples too. That should limit tarball size significantly

paperdigits · December 22, 2016, 11:36pm

I’ve been think about how to make things more redundant/resilient and was speaking with @andabata in IRC about making a backup with git/git-annex. Git-annex can scrape the data via RSS or WebDAV or any number of things. Then a special remote can be used to duplicate the data elsewhere.

axxel · January 6, 2017, 2:43pm

Congratulations to the new upload site: looks pretty good. I have a couple of suggestions, though:

add some kind of upload timestamp to the table, so I can sort by that and see what is new.
ideal would also be some rsync type of access to the raw files so I don’t have to download the whole 7GB all the time.
consider some automatic file renaming after upload based on Exif data that makes it easier to identify interesting files when working with parts of the complete set locally. My suggestion: Make-Model-Mode-Number.ext where Number is just counting upwards for the all files where Make-Model-Mode is the same. Spaces and invalid characters, like ‘:’ should be replaced with underscores. E.g. Canon-Canon_EOS_5D_Mark_III-mRaw-0.cr2.

andabata · January 6, 2017, 4:01pm

Can be added, just create an issue on github.

rsync would be an issue, in the way the data is stored (hashed based on the id it gets in the db). But other means of acces are always an option. I can always create something wich outputs in away you can process (xml/rss/cvs).

If you would use something from the previous point, i can do renaming on the fly. I won’t rename the files stored, but i can change the name when serving a download.

axxel · January 17, 2017, 1:57pm

Thanks for the new method for file synchronization! Very helpful.

There is just one issue I am having right now regarding file naming: the two GITUP-GIT2-* files contain ‘:’ in their name, which prevents me from storing them on a FAT files system. Would it be possible to make sure no filename contains those kind of invalid characters, even if that meant you’d have to rename them?

axxel · January 17, 2017, 2:40pm

There is one more thing I am having an issue with: the size of the set ;). Which went up from 7.x to 9.x GB in just two weeks. That is going to be fun…

Looking for ‘big’ files, I noticed there are 4 files that got imported from rawsamples.ch that are in fact no raw files at all:

RAW_HASSELBLAD_CFV.PPM
RAW_NIKON_D800_*.TIFF

They are of now value in this context. Please remove them.

Also there are 3 ‘old’ D800 files still present. They are all 14bit uncompressed, same as D800-14b-no-comp.NEF. The only difference is that they are digitally cropped to different sizes within the camera. From a rawspeed testing perspective they are all equivalent. For all I care, they should be removed as well. Maybe @LebedevRI would have some input regarding those 3 files?

LebedevRI · January 17, 2017, 4:30pm

By my estimates, we have samples for ~half of the cameras rawspeed supports/knows about. So the size will at least double, hopefully.

Yep, deleted that.

Hmm. They are raw files. Produced by camera. And, currently equivalently broken in rawspeed:

$ darktable-rs-identify RAW_NIKON_D800_L.TIFF 
Loading file: "RAW_NIKON_D800_L.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_M.TIFF
Loading file: "RAW_NIKON_D800_M.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_S.TIFF
Loading file: "RAW_NIKON_D800_S.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image

axxel · January 17, 2017, 9:31pm

Maybe you are right and they came right out of the camera like this. But according to my definition they are still no raw files, no more than JPEGs coming from a camera are raw files. Those tiffs contain 8bit RGB data for every pixel. You can open them in any standard image viewer just as they are.

axxel · January 25, 2017, 2:38am

@LebedevRI: I just noticed that the TIFF files are gone now. Thanks. This still leaves the ‘old’ uncompressed 14bit files with different crops to be discussed. Do you see any value for them in terms of testing librawspeed?

LebedevRI · January 25, 2017, 7:52am

@axxel

^ are you talking about the same files there ?