Rawsamples.ch and hosting raw sample files

LebedevRI · December 18, 2016, 11:50am

Tarball, unlike zip/etc is just an archive file, which then later can be optionally compressed, via separate tool. Tarball does not implicitly imply that it is additionally compressed. I do agree that in this case there is likely no point in additional compression.

This actually seems like an argument in favor of tarball. Last time i checked, downloading lot of small files is always not faster than one big archive with these files. And not easier.

From CI point of view, the complete tarball should have a “last changed” timestamp (next to it?), which can be checked, and if it does not match the locally-cached version, then the tarball will be downloaded and cached locally. With lots of small files, that will be more complicated i imagine.

Edit: or maybe git-lfs is the middle ground here, don’t know, did not use it before. But even then, for ci there is no interest in all the samples, only in unique-ones. So naturally tarball seems like the simplest choice.

patdavid · December 18, 2016, 2:22pm

I just wanted everyone to know that I am following this conversation carefully but am currently traveling. I will be posting later (maybe tomorrow), so please don’t think I’ve gone missing.

Ofnuts · December 18, 2016, 9:28pm

I think your misunderstood me… and we agree. Raw files are typically over 10Megs so you don’t donwload that many of them. Tarballs make a big difference when you download hundreds of small files like source code…

houz · December 18, 2016, 9:51pm

@LebedevRI needs all of the files present for automated testing. While it is wasteful to re-download a few GB just because one single image was added, it is much easier to script.

patdavid · December 19, 2016, 2:51am

I can use your cameras.xml as a basis for the required list, I think.

Do you mean to suggest that each CI build or test will have to download the entire tarball? This might be a little restrictive given the probable size of a final tarball. Depending on how you’re implementing CI (Travis?), it might make more sense to cache these on the build system? For instance, I cache npm modules and the build state of pixls.us between CI builds - this might make sense for what you want to do?

Couldn’t agree more.[quote=“LebedevRI, post:5, topic:2882”]
There needs to be a very simple and friendly web-ui for sample submission.
[/quote]

Absolutely agree.

Fair enough!

Good point. I will write an email to him tonight to find an answer to these questions. I certainly don’t want to duplicate effort if it can be avoided.

Agreed, though right now I’d like to try and solve the (harder imo) problem of getting folks engaged enough to help and upload files. We can certainly re-visit this.

Yes, I figure we’ll be looking at some large datasets, but the good side is likely a smaller number of transfer requirements. That is, I’d be surprised to find folks downloading everything multiple times in a day.

My bigger concern here is to remove these types of problems from the smarter folks who are doing the programming and creating software. I think we can handle the infrastructure so y’all (I am in the southern US) don’t have to worry about it. It should _just work_™ work for you.

At the moment I am trying to break this into some manageable tasks once we figure out some architecture answers. I’d like to keep the site and interaction static if possible, and I think we might be able to do this with the infrastructure we already have in place. It will require some work though.

In basic terms, we need.

An upload mechanism that is simple and low-effort for someone to participate and upload a file.
Ideally, I’m envisioning a user to be able to see a missing make/model with a “+” or “upload” button right there. Let them pick and upload a file directly to our infrastructure.
A mechanism for extracting information that we may need to sort/filter with.
A means for displaying the information and allow downloads.

I’ve taken a quick look at the Amazon AWS Lambda as suggested by @jinxos, which looks like it might be a nice fit. I can build a POST form to push the upload into a bucket, we can trigger the Lambda function to finish up processing for us (including pushing necessary files to support 3).

I’ll be back with more when I get some time to write down my thoughts. Getting late here.

LebedevRI · December 19, 2016, 11:00am

Awesome!

Yes, and yes, i understand that. So far i do not even know if doing it on Travis-CI will work (filesize limit?), or will not be too slow.

Yes, i certainly plan on trying to cache that. Or maybe Travis-CI won’t work for this, and we’ll need something with more direct control as proposed by @andabata. But i will try to make Travis-CI work.

FWIW, i am certainly NOT interested in running a sample for each ISO/aperture/etc through CI. And probably not so interested in garbage samples too. That should limit tarball size significantly

paperdigits · December 22, 2016, 11:36pm

I’ve been think about how to make things more redundant/resilient and was speaking with @andabata in IRC about making a backup with git/git-annex. Git-annex can scrape the data via RSS or WebDAV or any number of things. Then a special remote can be used to duplicate the data elsewhere.

axxel · January 6, 2017, 2:43pm

Congratulations to the new upload site: looks pretty good. I have a couple of suggestions, though:

add some kind of upload timestamp to the table, so I can sort by that and see what is new.
ideal would also be some rsync type of access to the raw files so I don’t have to download the whole 7GB all the time.
consider some automatic file renaming after upload based on Exif data that makes it easier to identify interesting files when working with parts of the complete set locally. My suggestion: Make-Model-Mode-Number.ext where Number is just counting upwards for the all files where Make-Model-Mode is the same. Spaces and invalid characters, like ‘:’ should be replaced with underscores. E.g. Canon-Canon_EOS_5D_Mark_III-mRaw-0.cr2.

andabata · January 6, 2017, 4:01pm

Can be added, just create an issue on github.

rsync would be an issue, in the way the data is stored (hashed based on the id it gets in the db). But other means of acces are always an option. I can always create something wich outputs in away you can process (xml/rss/cvs).

If you would use something from the previous point, i can do renaming on the fly. I won’t rename the files stored, but i can change the name when serving a download.

axxel · January 17, 2017, 1:57pm

Thanks for the new method for file synchronization! Very helpful.

There is just one issue I am having right now regarding file naming: the two GITUP-GIT2-* files contain ‘:’ in their name, which prevents me from storing them on a FAT files system. Would it be possible to make sure no filename contains those kind of invalid characters, even if that meant you’d have to rename them?

axxel · January 17, 2017, 2:40pm

There is one more thing I am having an issue with: the size of the set ;). Which went up from 7.x to 9.x GB in just two weeks. That is going to be fun…

Looking for ‘big’ files, I noticed there are 4 files that got imported from rawsamples.ch that are in fact no raw files at all:

RAW_HASSELBLAD_CFV.PPM
RAW_NIKON_D800_*.TIFF

They are of now value in this context. Please remove them.

Also there are 3 ‘old’ D800 files still present. They are all 14bit uncompressed, same as D800-14b-no-comp.NEF. The only difference is that they are digitally cropped to different sizes within the camera. From a rawspeed testing perspective they are all equivalent. For all I care, they should be removed as well. Maybe @LebedevRI would have some input regarding those 3 files?

LebedevRI · January 17, 2017, 4:30pm

By my estimates, we have samples for ~half of the cameras rawspeed supports/knows about. So the size will at least double, hopefully.

Yep, deleted that.

Hmm. They are raw files. Produced by camera. And, currently equivalently broken in rawspeed:

$ darktable-rs-identify RAW_NIKON_D800_L.TIFF 
Loading file: "RAW_NIKON_D800_L.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_M.TIFF
Loading file: "RAW_NIKON_D800_M.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image
$ darktable-rs-identify RAW_NIKON_D800_S.TIFF
Loading file: "RAW_NIKON_D800_S.TIFF"
ERROR: [rawspeed] NEF Decoder: Unable to locate image

axxel · January 17, 2017, 9:31pm

Maybe you are right and they came right out of the camera like this. But according to my definition they are still no raw files, no more than JPEGs coming from a camera are raw files. Those tiffs contain 8bit RGB data for every pixel. You can open them in any standard image viewer just as they are.

axxel · January 25, 2017, 2:38am

@LebedevRI: I just noticed that the TIFF files are gone now. Thanks. This still leaves the ‘old’ uncompressed 14bit files with different crops to be discussed. Do you see any value for them in terms of testing librawspeed?

LebedevRI · January 25, 2017, 7:52am

@axxel

^ are you talking about the same files there ?

axxel · January 25, 2017, 11:26am

No, I’m talking about the RAW_NIKON_D800_14bit_*.NEF files.

LebedevRI · January 25, 2017, 12:33pm

Ah.
Well, those are certainly not plain LDR
They are currently loaded via rawspeed (in dt), and do contain raw mosaic.
So these (and similar raws from other cameras) are staying.

The fact that they currently work is just dumb luck:
E.g. i know that in rawspeed, support for some nikon uncompressed raws is (was?) broken.
But some uncompressed nef’s work. So all raw samples are wanted.

axxel · January 25, 2017, 1:16pm

They do contain CFA raw data (never said otherwise). I said the only difference (that I know of) between them and the new D800-14b-no-comp.NEF is that they are cropped in camera. So width/height are different while all codec related properties seem the same. Hence, I see no value for regression testing or maybe as little value as different ISO samples would have. I have not checked whether they do execute the exact same code paths, though.

LebedevRI · January 25, 2017, 1:38pm

TLDR: slight sample overlap is so much better than partial sample coverage.

just like Panasonic and Leica aspect ratios - not all of them, but some of them actually contain different raw crop. The fact that they work or don’t work is irrelevant. Nowadays they may be the same as normal raws from these cameras, and in next camera they may be completely different. That just just luck.

And sometimes, manufacturers (notably: panasonic, canon XXXXD/XXXD) even release (supposedly) the same cameraunder up to three different names. So everything is the same but the camera name.

No matter if we like it or not, r.p.u purpose is NOT to be just a source of different raw samples only for rawspeed testing with zero overlap. It is more generic than that.
Yes, it absolutely should have all the samples needed for rawspeed testing. But it absolutely should not NOT have other samples, which are slightly duplicating the existing samples (from testing perspective).

If we decide (we won’t) to deduplicate such things, it is just bound to blow up into our faces. Else, we will never know that some of them don’t work should the manufacturers decide to change something again.

axxel · January 25, 2017, 2:17pm

Alright. I’ll look into some other means of reducing my local set to the minimum that still provides maximum coverage (as already discussed on github a while back). Well, iff I find the time and it turns out to be worth the effort.

But in the Nikon example you would then end up with the cross product of bit depth x compression x crop, which would be 2 x 3 x 4 = 24 files for each camera. A few extra files here and there might be one thing but blowing it up like this could be serious enough to think about what is really helpful and what is not. Like only saving enough samples so each property is covered at least once but not the full set of combinations. Maybe the combination bit depth vs compression is helpful (have not looked at the code). Just saying.