Git annex use in workflow

As there is loads of very good info on the use of git annex in my workflow topic, I’m creating here a new topic that can talk about this info. This way it is easier for everybody to find the correct info.

Thx to @paperdigits workflow and explanations we have a rather complete workflow. I’m quoting parts from the other topic, these are not my posts, they are reposts of @paperdigits answers to mine and @luxagraf questions

Workflow:

Why use git-annex and not rsync

2 Likes

@paperdigits: if you feel up to would you mind providing us with a little post how to setup this up?

That way I’m going to setup it up and walking through all the steps to get it working with my very large database and make a tutorial out of it, for those that are interested.

It seems to me that it’s important to explain why from your perspective this system better than Jason DeRose’s Dmedia. I don’t use either, but it strikes me from what’s been posted thus far they’re both trying to solve similar problems.

I don’t know if git-annex is better, probably there is no better but just different, but git-annex

  • has a website (didn’t know about dmedia before, seems to exist in launchpad only)
  • integrates with git very well (think of xmp versioning)
  • is reasonably stable and mature (from the dmedia website: “NOTE THAT DMEDIA IS NOT YET PRODUCTION READY!
    TEST DMEDIA, BUT PLEASE DON’T YET TRUST YOUR DATA TO IT!”)

But anyway, I have to check dmedia now ;-).

So I’ve been playing with git annex a bit and one thing that seems to make it an ideal solution for RAW files in particular is that it locks files, which means your RAW files are read only. In other words you wouldn’t want to add your .xmp files as well since they’re not very big and will frequently be written to by editing software.

As @paperdigits mentions above you could always use regular git for any sidecar files.

One thing I’m not super crazy about is that it moves all your actual files into .git/annex and then creates symlinks where the actual files used to be. I get why it needs to do that, but it does make me a tad nervous.

You should also be able to tell git-annex to ignore xmp/pp3 and other sidecar files, so that only git can see them!

I will certainly write up a post about me use, but it might take me a day or two.

2 Likes

It would be great, when I set it up, I will try to have git track the xmp files.

Setting up git-annex

  1. Create the repository and initialize it.
~$ cd ~/
~$ mkdir Photographs
~$ cd Photographs
~$ git init
~$ git-annex init "My beautiful photographs"
  1. Create a backup repository and initialize it.
~$ cd /media/external/
~$ mkdir Photographs-backup
~$ cd Photographs-backup
~$ git init
~$ git-annex init "A backup of my beautiful photographs"
  1. Let the repositories know about one another:
  • From ~/Photographs:
    ~$ git remote add backup /media/external/Photographs-backup
  • From /media/external/Photographs-backup:
    ~$ git remote add homedir ~/Photographs
  1. Tell git-annex to keep multiple copies of your files.
~$ echo "*.nef annex.numcopies=2" >> .gitattributes
  1. Tell git-annex to ignore XMP files so you can check them into git (requires git annex version 6.20160126).
~$ echo "*.xmp annex.largefiles=nothing" >> .gitattributes
  1. Add some content.
~$ rsync -rvP --progress /media/camera/DCIM/ ~/Photographs/
~$ cd ~/Photographs/
~$ git-annex add . # note it isn't git add .
~$ git commit -m "Added photos from today's walk."
  1. Sync the content between the repositories.
~$ cd ~/Photographs
~$ git-annex sync --content

Edit: Markdown is failing me...
8 Likes

Thx a lot for the explanation.

Going to walk through the tutorial in a little bit, to set this up.

I finally made it to setup a test repository for managing my photographs with git-annex and still have a couple of questions (there is lots of documentation on https://git-annex.branchable.com, but I find it hard to get answers for my specific questions there). Since you are already using the tool, maybe you or somebody else could give some insight.

I am assuming a setup with 2 repositories, a notebook computer with limited space and an external drive with infinite space. The latter should hold all data (and is therefore in the “backup” group), while the former should only keep important data and the files that are actually worked on. numcopies is globally set to “1”.

  1. When I git annex sync --content, git-annex tends to move the earlier dropped files back onto the notebook. How can I tell git-annex to not do so, without the need to git annex move every file manually and stay away from the automatic sync?
  2. How can I individually override the automatic decision for single files directly (not setting numcopies)? I want to tell it to keep some files on the notebook regardless of any other settings (e.g. the 5 star pictures).
  1. Use git annex sync without --content. This will sync the list of files or the “state” of the repo, but not the content.

  2. Its pretty early here, so I reserve the right to edit this answer :wink: You might be able to read the file metadata into got annex so you can use a view to keep those files.

A few things: if you only have the two repos, you probably never want to use sync --content. That will just put all the files in both repos. In the backup repo use git annex get . in the root of the directory to get all the files from the laptop. On the laptop, use git annex get <somefile> to get specific files.

I’ll update this post with some workflows for you when I’m fully awake. :slight_smile:

1 Like

Thanks for the tip to use get instead of sync --content. I was hoping that there is a possibility to configure the repository to default to not pulling back dropped content unless explicitly asked for. Do you know such an option? It would be much simpler to not have to be that careful and never get . on the wrong side.

Hm, I found a way to make annex drop files that are in the other repository, which is step 1. Therefore, I set git annex wanted . "not (copies=1)". The next step would be to define the exceptions, i.e., the files I definitely want on both sides. I thought it would be enough to set echo "file annex.numcopies=2" >> .gitattributes, but it does not help, file is not synced back after dropping. Any idea? Or is this too far off topic and should I totally rely on the git-annex forum (I had asked a different question there, but for the special needs of photographers I would prefer to get here as far as we can to have things documented for others).

Hey @chris, sorry, your initial post caught me at an awkward time, I had to travel this week for work and it was a working marathon.

I think defining some specific use cases would be really helpful here.

Say you have two repos: laptop and backup. laptop is your laptop with limited space; backup is the external disk that is large enough to hold your whole collection of photos for a while. I’ll also assume that the available space on laptop is less than the total size of your photo collection, so doing git annex sync --content isn’t feasible, since your disk will run out of space.

It seems like you have both repos set up correctly and they can “talk” to one another, e.g., when you issue a git annex command, the repos can sync the git state and transfer files (if this assumption is not correct, we’ll back track a little and get it configured correctly).

Workflows scenarios:

  • Sync the git state of both repos without transferring content
  • Get an arbitrary file on to laptop laptop that isn’t there
  • Remove an arbitrary file from laptop
  • Keep 5-start images available on laptop
  • Import a days worth of shooting to laptop
  • Import a days worth of shooting to backup

Please let me know if there are any more workflows you’d like to add.

2 Likes

Hey @paperdigits, thanks for your support on my weird plans and fantasies on how things should work :smile:.

I still struggle with some basic concepts, e.g. which rule takes precedence, “numcopies” or the rule in “wanted”, or if I have to make “wanted” aware of the numcopies requirement. I know that your approach on my issue would be to avoid sync --content, but I still think there is a way to configure the repository in a way such that things work as expected automatically and that I cannot accidentally break it (e.g. by running sync --content).

Regarding workflow scenarios:

That’s the easy one: git annex sync, and it works as expected.

git annex get <file>

git annex drop <file/pathspec> or, if configured correctly, git annex drop -a

That’s a tricky one, and after some discussion in the git-annex forum, I think the best approach would be to sync the metadata (only the star rating or maybe other workflow related metadata) from sidecars to the git-annex metadata (always in one direction, from sidecar to git-annex) and have preferred content rules to reflect the personal ideas about which files to keep. I am working on a little shell script to parse the sidecars and set metadata accordingly. But this will take a while, not too much time recently …

Besides adding files to git-annex, I would skip more complex settings such as having memory cards configured as git-annex with the “source” group setting, since the directory structure of memory cards in cameras is somehow restricted, and there are great tools available to do this without git-annex (e.g. rapid photo downloader).

Besides these, there would be a couple of workflows to add, but I am running out of time (family wants breakfast) … But an important one would be to drop jpeg duplicates of raw files when shooting raw+jpeg. I had some discussion over at the git-annex forum regarding this, see here, and this seems to be not (easily) possible automatically from git-annex side, so I am working on a shell script to set git-annex metadata accordingly, something such as

for k in `find . -name "*.CR2"`; do
  if [ -e `dirname $k`/`basename $k .CR2`.JPG ]; then
      <set metadata>
  fi
done

but a bit more robust and with the correct meta data setting.

Hm, and, of course, the other way round as well: drop rejected and less than or equal n stars, and I would start with n=1 as long as disk space permits …

Sounds like you’ve got a solid idea of what you are doing. I didn’t mean for you to explain everything, I was just trying to clarify what the end goal is.

You should check out wanted = present as it will keep around files on the laptop that are already there, allowing you to git annex sync --content, though it may keep you from using other preferred content settings, it wasn’t super clear to me, I only read the man page twice.

There was no complaint in my response, I am happy for each little hint that helps me better understanding how things work. The next days I’ll try to better document how my setup should look like eventually.

Hm, similar on my side. I read the documentation many times over the last two weeks, but I still struggle with some details. Unfortunately, only some standard use cases are documented really well, but if you want to dive deeper, there’s a lot to discover by try and error.

I think here’s the point, I did not understand the preferred content settings well enough to achieve my goal. And the git annex find command is still a mystery to me, I am still not able to mimic a --dry-run behaviour with it.

When I’ve started to look into git annex, packages disappeared from distributions I’m using. When I ask for the reason the answer was use git-lfs and it looks like this will obsolete git annex.

Hm, for some applications they are similar, but git-annex can do so much more. I hope it will find its way back into distros. E.g., what I am working on here is impossible with lfs.