Recent comments posted to this site:

are there plans to have chunks stored in the regular backend storage?

i'm curious because one my use cases is archiving websites, where we end up with lots of WARC files. those files are basically a bunch of files from the website gzipp'd together in a stream, which means that multiple crawls of the same website (or actually, different website) have lots of redundant data (e.g. jQuery.js). storing those files in git-annex is not very efficient, because that data is duplicated all over the place.

if the storage backend was chunked, there could be massive deduplication across those files... this is why i looked at the borg special remote: I figured that i could at least deduplicate on the remote side, but it would still be nice to have this built-in! -- anarcat

Comment by anarcat Mon Dec 3 17:30:00 2018

The bup special remote does exist, so if you want to use that special remote, you can get efficient storage and transfer of related versions. It would probably be possible to make bup use the same git repo as git-annex, just storing its data in a separate branch, but I have not tried it.

Comment by joey Mon Dec 3 17:30:00 2018
To add a new remote, use the git remote add command on the terminal, in the directory your repository is stored at. The git remote add command takes two arguments: A remote name, for example, origin. For more information visit - https://www.applesupportphonenumbers.com/blog/fix-mac-error-code-36/
Comment by techcustomersupport Mon Dec 3 17:30:00 2018

I think that annex always uses cp --reflink=auto for local paths (cache remote was on a local path right?). I guess running with --debug could have helped to resolve the mystery ;-)

BTW -- checked locally - reflink=auto seems to work nicely across subvolumes of the same BTRFS filesystem. "copying" gigabytes takes half a second or so ;-) (without reflink=auto - takes considerably longer)

Comment by yarikoptic Fri Nov 23 18:12:39 2018

Well, I did not check reflink=auto. I just checked first with git annex copy --to=cache which simply duplicated every file. So there are two posibilities:

  1. reflink=auto doesn't work (I did not check but I think it work)
  2. git-annex does not recognise the other path to be on the same filesystem so went back to simply copy (very likely but joey has to check.)
Comment by Mowgli Wed Nov 21 19:45:30 2018

It doesn't work well if the source of the copy is in a btrfs subvolume and the cache is in another subvolume of the same filesystem.

With that setup every file is really copied instead of using reflink=always.

I solved it currently by copying .git/annex/objects manual into the cache (cp -a --reflink=always .git/annex/objects ~/.cache/annex/ and afterwards doing the git annex cp which recognice the existence of the objects.

Comment by Mowgli Wed Nov 21 19:28:37 2018

Hi Mowgli, could you please elaborate for a slow me -- are you saying that --reflink=auto is not causing CoW between different subvolumes of the same filesystem while --reflink=always does?

P.S. glad to see more of BTRFS & git-annex tandem users around ;-)

Comment by yarikoptic Wed Nov 21 19:28:37 2018

The scenario that isStableKey is being used to guard against is two repos downloading the content of an url and each getting different content, followed by one repo uploading some chunks of its content and then the other repo "finishing" the upload with chunks of its different content. That would result in a mismash of chunks being stored in the remote.

It's true that it could also happen using WORM with an url attached to it. (Not with other types of keys that verify a checksum.) Though it seems much less likely, since the file size is at least checked for WORM, while with URL keys there's often no recorded file size. And, WORMs don't typically have urls attached (I can't think of a single time I've ever done that, it just feels like asking for trouble), while URL keys always do.

If this is a serious concern, I'd suggest you open a todo or bug report about it, there are far too many comments to wade through here already. We could think about, perhaps not allowing download of WORM keys from urls or something like that..

Comment by joey Tue Nov 13 04:11:47 2018

As usually, create a directory on the server, git init, then git-annex init there.

Add that locally: git remote add my-server-name my-server:~/my-repo

git-annex sync locally seems to work fine and pushes data to the server.

I needed to have this workaround before, because I could not get data from my laptop while on the server (I wasn't sure I had an open IP address for my laptop). This is mostly a basic thing in git, but I had errors with git-annex earlier and I try to be cautious now.

Comment by metst13 Tue Nov 6 20:00:13 2018

I must add that after the previous commands finished, content of the repository was not shown on the server (it was all in .git). I made

git checkout synced/master

and files appeared. Unfortunately it seems all timestamps were new, which I didn't like, and it was already asked here (http://git-annex.branchable.com/todo/does_not_preserve_timestamps/).

Comment by metst13 Tue Nov 6 20:00:13 2018