Recent comments posted to this site:

Thank you! In my case, since safe commit is too far in the past, what I had in mind is a little different: I wanted to have a completely disconnected history with a commit which had $privatefiles moved to annex, but I think the approach is "the same" in effect. The only thing I would do differently is to first convert files to git-annex in master (to become unredacted-master), so I end up with the same tree in unredacted-master and master happen that later I would need to cherry-pick some changes accidentally committed (e.g. by collaborators) on top of unredacted-master.

Another note worthwhile making IMHO is that AFAIK those git replace markers are local only, and whoever has unredacted-master later on might need to set them up as well for their local clones to make such a "collapse" of histories.

Comment by yarikoptic Fri Mar 1 19:44:08 2024

Is it possible to add git-lfs capabilities to a git-annex, without using a special remote?

I guess what I want is, are there any reasonable instructions to graft the hooks so that this is possible:

$ git init
$ git-lfs install
$ git-annex init

And you can alternate between something like below:

$ git-lfs track "*.exif_thumbnail.*"
$ git-annex add IMG_0001.jpg
$ git add IMG_0001.exif_thumbnail.jpg

Obviously this betrays the scenario of extracting thumbnails from the EXIF header and storing them alongside, as another form of metadata. If there's a better workflow to this, that would be appreciated too.

Comment by beryllium Fri Mar 1 19:44:08 2024

Sounds like you might want to use datalad, which is built around git annex and where submodules are a first-class citizen.

Datalad handles submodules as subdatasets and add python code layers on it to handle datasets(e.g. dedup submodules). But it doesn't detect the submodules path changed like git.

So, it doesn't do my needs sadly.

Comment by TTTTAAAx Fri Mar 1 19:44:08 2024

@TTTTAAAx kindly posted a full example of their problem, which I've moved to detect and handle submodules after path changed by mv.

I do think that using git mv to rename directories that contain submodules is the right way to avoid that kind of problem. Note that renaming such a directory without using git followed by running git add on the new directory has the same behavior as running git-annex assist does. This is not a git-annex problem, but I think it could be considered a git problem; git could make git add of a moved submodule do the right thing.

Comment by joey Fri Mar 1 19:44:08 2024

Another note worthwhile making IMHO is that AFAIK those git replace markers are local only, and whoever has unredacted-master later on might need to set them up as well for their local clones to make such a "collapse" of histories

Right, any repository you fetch unredacted-master into, you will also want to fetch refs/redacted/ to as well, and run git replace there, as shown in the last code block of the tip above.

Comment by joey Fri Mar 1 19:44:08 2024

Here is a script I crafted to use to make it easy and reuse current tree object for new "squashed history" commit

#!/bin/bash
#
# A helper to establish an alternative history to hide commits which could have
# leaked personal data etc.
#
# More information on motivation etc and another implementation could be
# found at https://git-annex.branchable.com/tips/redacting_history_by_converting_git_files_to_annexed/
#

set -eu

BRANCH=$(git rev-parse --abbrev-ref HEAD)
: "${SECRET_BRANCH:=unredacted-$BRANCH}"
SAFE_BASE="$1"

git branch "${SECRET_BRANCH}"

rm -f .git/COMBINED_COMMIT_MESSAGE
echo -e "Combined commits to hide away sensitive data\n" >> .git/COMBINED_COMMIT_MESSAGE
git log --format=%B "$SAFE_BASE..HEAD" >> .git/COMBINED_COMMIT_MESSAGE

# the tree we are on ATM
TREE_HASH=$(git log -1 --format=%T HEAD)
NEW_COMMIT=$(git commit-tree $TREE_HASH -p "$SAFE_BASE" -F .git/COMBINED_COMMIT_MESSAGE)
rm -f .git/COMBINED_COMMIT_MESSAGE
git reset --hard $NEW_COMMIT

git replace "$BRANCH" "$SECRET_BRANCH"
Comment by yarikoptic Fri Mar 1 19:44:08 2024

Is it possible to somehow make git annex whereis show the response of the special remote to WHEREIS over multiple lines? Just including newlines obviously results in an error, since that ends the WHEREIS-SUCCESS message.

I am implementing a special remote for which the data is fully described by what is essentially a json-encoded request to a third-party API, and I would like to show this json string pretty-printed over multiple lines in the whereis output, instead of as a single line.

Comment by matrss Fri Mar 1 19:44:08 2024

@craig, all of git-annex's information about a special remote is stored in the git-annex branch in git, so any clone of the git repository is sufficient to back that up. You can run git annex enableremote in an clone to enable an existing special remote.

The only catch is that, if you have chosen to initremote a special remote using a gpg key, with keyid=whatever, you'll of course also need that gpg key to to use it. If you run git annex info $myremote it will tell you amoung other things, any gpg keys that are used by that remote.

Comment by joey Fri Feb 9 13:49:29 2024
would it be possible to provide a combined backend of worm + partial hash? i'd imagine that this would make the backend faster than merely hashes while also lower the probability of erroneously identifying two different, but worm-equivalent files.
Comment by windfish Fri Feb 9 13:49:29 2024

Hi, what would be a recommended setup and the working procedures for the following scenario:

  • using git-annex version: 8.20210223, which is the one in ubuntu-22.04 (can't upgrade easily)
  • a central server as a mutable central archive, many users (over ssh)
  • users are all trusted
  • the server shall keep all annexed files, but only the HEAD version is relevant, that is: if the file is removed by the user, it shall eventually be permanently removed from the central server too, to save space.
  • users would tipically not need all the files, but only some, so git annex get files... would do
  • users would also add or remove annexed files (and push them to the central repository)
  • a user might remove his/her local repository at any time, so the central server shall not keep track about clones or at least shall not care if any or all clones get removed

I have created central repository like this (please correct me):

git init test --bare                                                                                       
cd test                                                                                                    
git annex init                                                                                             
git annex required . "include=*" 

On the user site

git clone ssh://some.server/repo/test test
cd test
dd if=/dev/random of=./bigfile bs=1M count=10
git annex add bigfile
# how to sync (push only)?
# how to permanently remove big file?
cd ..
# done with the task
chmod -R 777 test
rm -rf test

What I am looking for is the sequence of commands for the users, to:

  • sync to the latest state (without fetching the content)
  • add new annexed file to the repository and push it
  • permanently delete annexed file

There are several issues I am facing at the moment. I was expecting to push the new file with git annex sync --content --no-pull, but this command also pulls the contents of all annexd files, which I don't want. The server does not want to remove the old content. It looks like I am doing something wrong. Appreciate your suggestions about this scenario.

Comment by zoran.bosnjak Fri Feb 9 13:49:29 2024