Rawsamples.ch and hosting raw sample files

LebedevRI · December 17, 2016, 12:56pm

I’m a little bit rusty with web dev, but i’m pretty sure that is indeed absolutely needed for proper implementation.

damonlynch · December 17, 2016, 5:42pm

It strikes me as important because it could well be very helpful in helping the the database indicate which files have metadata that is unable to be read by mainstream free software tools. One obvious example of this is magic lantern produced RAWs for Canon cameras that don’t otherwise produce them.

On the other hand, a less obvious example (and maybe less relevant example) is when RAW processors produce a DNG whose metadata is corrupt. I don’t know if the new database will host files of the latter type, but I suppose one could make an argument they do belong. In that case it’s not just a matter of firmware, but the software and version that produced them, potentially adding yet another couple of fields.

A third example is Android. Last I checked the metadata in Android RAW files is mostly unusable due to a bug in Google’s RAW support. So the Android version would be useful too.

LebedevRI · December 17, 2016, 6:10pm

From Magic Lantern | Home

Supported Cameras:
5D2, 5D3, 6D, 7D, 50D, 60D, 500D/T1i, 550D/T2i, 600D/T3i, 650D/T4i, 700D/T5i, 1100D/T3, EOS M
In progress:
70D, 100D/SL1
Inactive ports (help welcome):
5D classic (old version available)
40D (not working at all)

Which one of these cameras don’t produce raw files?

You don’t mean when exiv2/exiftool is used to modify the raw file?
If not, then that is still a raw file from camera, so i’m pretty sure it fits here.

Same, still a raw file from camera.

damonlynch · December 17, 2016, 6:23pm

Sorry my mistake. I should have written CHDK, which produces CR2 and other files whose metadata cannot be read by exiv2 / ExifTool.

There are two issues here:

The submitter modifies the RAW file before uploading it to the raw samples DB, using some or other tool. It definitely happens, and it’s highly annoying.
Some software programs e.g. DxO OpticsPro can produce linear DNG files. Sometimes their metadata is corrupt.

LebedevRI · December 17, 2016, 6:38pm

$ grep -i canon cameras.xml | grep chdk
        <Camera make="Canon" model="PowerShot SD300" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A460" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A610" mode="chdk">
        <Camera make="Canon" model="PowerShot A530" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot S3 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot A620" mode="chdk">
        <Camera make="Canon" model="PowerShot A470" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A720 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot A630" mode="chdk">
        <Camera make="Canon" model="PowerShot A640" mode="chdk">
        <Camera make="Canon" model="PowerShot A650" mode="chdk">
        <Camera make="Canon" model="PowerShot SX110 IS" mode="chdk">
        <Camera make="Canon" model="PowerShot SX120 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX20 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX220 HS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot SX30 IS" supported="no" mode="chdk">
        <Camera make="Canon" model="PowerShot A3300 IS" supported="no" mode="chdk">

So yes, all these cameras are already in the list “samples wanted”, and are more or less supported by darktable/rawspeed.

Sorry, no sympathy/understanding from me here. Ability to modify raw files is a bug in some other tool. If the samples are modified by user, then it’s users fault and these samples are just garbage, no point in keeping them.

Do note that those are not really a cameras (and this is mostly about cameras), but just buggy software. For rawspeed, i’m mostly sure we won’t be adding workarounds to keep intentionally-broken (read: edited) raw files working. That being said, even darktable’s “create HDR DNG” functionality is still producing half-broken DNG’s (no white balance info)

damonlynch · December 17, 2016, 6:49pm

I’ve analyzed every file in the raw sample DB. Last I checked those CHDK files are impossible to get metadata out of using mainstream free software tools. Maybe that’s not important to you as someone interested primarily in rawspeed? But it is important to myself, and maybe to others too. Personally I find them useful as samples of garbage input that my program should handle gracefully. So I think it would be very useful for the DB to differentiate between categories of RAW.

Then they should be purged and the new DB should make it very clear to uploaders that they should never submit user modified RAW files. The existing raw samples site did not do that last I checked.

It’s not clear to me if the new site should host all RAW files or just RAW files straight out of the camera. That’s why I bring up that example. Personally I don’t mind either way but it’s something to keep in mind – I imagine RAW samples generated by software programs might be useful to some free software developers.

heckflosse · December 17, 2016, 7:09pm

Good point! Even corrupted raw files are good for test purpose.

LebedevRI · December 17, 2016, 7:33pm

True. Though it is simple-ish to generate such files, by corrupting the proper raws. (also, fuzzing)

Yep, i did state that in my first comment already.

Now that is a valid important point.

It’s not really up to me to decide that. I can only say (already did say) what samples i want available.

PS: it’s raw, not RAW.

axxel · December 18, 2016, 2:02am

I’m also in a discussion with @LebedevRI about a regression testing tool for rawspeed. I am wondering: has anyone tried to contact Jakob Rohrbach from rawsamples.ch and ask him about the current (and expected future) state of his site? If so: what did he have to say about it?

heckflosse · December 18, 2016, 2:03am

I agree that we need (it would be great to have) a source for raw files.

But why do we need it?

There are technical reasons common to all raw development software like getting proper black and white levels and raw crops for example.
The same is true for raw files to check correct decoding (corrupted raw files, manufacturers incorporating new compression formats etc.)

But there are also differences in requirements:

DT needs to get some data for raw denoise iirc (please correct me if I’m wrong here)

RT needs to get proper white levels from raw files scaled by ISO and aperture (means even more raw samples). In fact it does not need it, but it supports it!

PF needs (don’t know, @Carmelo_DrRaw )

Maybe there are more than the above mentioned differences.

Doh, now we have the first difference. That means we have to make a superset of all requirements.

I’m absolutely not against that.
I just want to mention it and that in this case a cross-ref to cameras.xml may not be enough to fulfill the requirements.

Carmelo_DrRaw · December 18, 2016, 8:10am

PhF is “stealing” the RAW decoding from DT and RT, so it does not have additional requirements so far…

LebedevRI · December 18, 2016, 9:50am

All this “different requirements” are just different samples, per camera.
The camera list is still the same, for every program. Because there is a fixed list of produced cameras.
So i completely fail to see why would it not be possible to generate camera list from cameras.xml.

Edit: yes, cameras.xml does not list all the cameras ever produced, so additional sources may be used, like RT’s camconst.json(?); and there should pobably be a way to select “sample for camera not in the list” during upload. The camera maker and model can be trivially extracted from Exif/Makernotes

Ofnuts · December 18, 2016, 11:04am

Why tarballs… people aren’t going to download hundreds of small files, and the files are already in a compressed format. A hierarchical directory would be enough (people would just download a subtree…). With Unix links you can have several different hierarchies without duplicating the files.

Something to take in account is the network bandwidth. Of course DigitalOcean isn’t really metering bandwidth (if we put this on DO) and this may not be a very popular site, but this is still a pretty big repository. So having some way to limit the bandwidth for anonymous users (those using HTTP/FTP, for instances) and a speedier method for identified people (SCP/SFTP using SSH keys, for instance).

LebedevRI · December 18, 2016, 11:50am

Tarball, unlike zip/etc is just an archive file, which then later can be optionally compressed, via separate tool. Tarball does not implicitly imply that it is additionally compressed. I do agree that in this case there is likely no point in additional compression.

This actually seems like an argument in favor of tarball. Last time i checked, downloading lot of small files is always not faster than one big archive with these files. And not easier.

From CI point of view, the complete tarball should have a “last changed” timestamp (next to it?), which can be checked, and if it does not match the locally-cached version, then the tarball will be downloaded and cached locally. With lots of small files, that will be more complicated i imagine.

Edit: or maybe git-lfs is the middle ground here, don’t know, did not use it before. But even then, for ci there is no interest in all the samples, only in unique-ones. So naturally tarball seems like the simplest choice.

patdavid · December 18, 2016, 2:22pm

I just wanted everyone to know that I am following this conversation carefully but am currently traveling. I will be posting later (maybe tomorrow), so please don’t think I’ve gone missing.

Ofnuts · December 18, 2016, 9:28pm

I think your misunderstood me… and we agree. Raw files are typically over 10Megs so you don’t donwload that many of them. Tarballs make a big difference when you download hundreds of small files like source code…

houz · December 18, 2016, 9:51pm

@LebedevRI needs all of the files present for automated testing. While it is wasteful to re-download a few GB just because one single image was added, it is much easier to script.

patdavid · December 19, 2016, 2:51am

I can use your cameras.xml as a basis for the required list, I think.

Do you mean to suggest that each CI build or test will have to download the entire tarball? This might be a little restrictive given the probable size of a final tarball. Depending on how you’re implementing CI (Travis?), it might make more sense to cache these on the build system? For instance, I cache npm modules and the build state of pixls.us between CI builds - this might make sense for what you want to do?

Couldn’t agree more.[quote=“LebedevRI, post:5, topic:2882”]
There needs to be a very simple and friendly web-ui for sample submission.
[/quote]

Absolutely agree.

Fair enough!

Good point. I will write an email to him tonight to find an answer to these questions. I certainly don’t want to duplicate effort if it can be avoided.

Agreed, though right now I’d like to try and solve the (harder imo) problem of getting folks engaged enough to help and upload files. We can certainly re-visit this.

Yes, I figure we’ll be looking at some large datasets, but the good side is likely a smaller number of transfer requirements. That is, I’d be surprised to find folks downloading everything multiple times in a day.

My bigger concern here is to remove these types of problems from the smarter folks who are doing the programming and creating software. I think we can handle the infrastructure so y’all (I am in the southern US) don’t have to worry about it. It should _just work_™ work for you.

At the moment I am trying to break this into some manageable tasks once we figure out some architecture answers. I’d like to keep the site and interaction static if possible, and I think we might be able to do this with the infrastructure we already have in place. It will require some work though.

In basic terms, we need.

An upload mechanism that is simple and low-effort for someone to participate and upload a file.
Ideally, I’m envisioning a user to be able to see a missing make/model with a “+” or “upload” button right there. Let them pick and upload a file directly to our infrastructure.
A mechanism for extracting information that we may need to sort/filter with.
A means for displaying the information and allow downloads.

I’ve taken a quick look at the Amazon AWS Lambda as suggested by @jinxos, which looks like it might be a nice fit. I can build a POST form to push the upload into a bucket, we can trigger the Lambda function to finish up processing for us (including pushing necessary files to support 3).

I’ll be back with more when I get some time to write down my thoughts. Getting late here.

LebedevRI · December 19, 2016, 11:00am

Awesome!

Yes, and yes, i understand that. So far i do not even know if doing it on Travis-CI will work (filesize limit?), or will not be too slow.

Yes, i certainly plan on trying to cache that. Or maybe Travis-CI won’t work for this, and we’ll need something with more direct control as proposed by @andabata. But i will try to make Travis-CI work.

FWIW, i am certainly NOT interested in running a sample for each ISO/aperture/etc through CI. And probably not so interested in garbage samples too. That should limit tarball size significantly

paperdigits · December 22, 2016, 11:36pm

I’ve been think about how to make things more redundant/resilient and was speaking with @andabata in IRC about making a backup with git/git-annex. Git-annex can scrape the data via RSS or WebDAV or any number of things. Then a special remote can be used to duplicate the data elsewhere.