For publishing content from a git-annex repository, it would be useful to be able to export a tree of files to a special remote, using the filenames and content from the tree.
(See also export and dumb, unsafe, human-readable backend)
configuring a special remote for tree export
If a special remote already has files stored in it, switching it to be a tree export would result in a mix of files named by key and by filename. That's not desirable. So, the user should set up a new special remote when they want to export a tree. (It would also be possible to drop all content from an existing special remote and reuse it, but there does not seem much benefit in doing so.)
Add a new initremote
configuration exporttree=true
, that cannot be
changed by enableremote
:
git annex initremote myexport type=... exporttree=true
It does not make sense to encrypt an export, so exporttree=true requires (and can even imply) encryption=false.
Note that the particular tree to export is not specified yet. This is because the tree that is exported to a special remote may change.
exporting a treeish
To export a treeish, the user can run:
git annex export $treeish --to myexport
That does all necessary uploads etc to make the special remote contain the tree of files. The treeish can be a tag, a branch, or a tree.
Users may sometimes want to export multiple treeishes to a single special remote. For example, exporting several tags. This interface could be complicated to support that, putting the treeishes in subdirectories on the special remote etc. But that's not necessary, because the user can use git commands to graft trees together into a larger tree, and export that larger tree.
If an export is interrupted, running it again should resume where it left off.
It would also be nice to have a way to say, "I want to export the master branch", and have git-annex sync and the assistant automatically update the export. This could be done by recording the treeish in eg, refs/remotes/myexport/HEAD. git-annex export could do this by default (if the user doesn't want the export to track the branch, they could instead export a tree or a tag).
updating an export
The user can at any time re-run git-annex export with a new treeish to change what's exported. While some use cases for git annex export involve publishing datasets that are intended to remain immutable, other use cases include eg, making a tree of files available to a computer that can't run git-annex, and in such use cases, the tree needs to be able to be updated.
To efficiently update an export, git-annex can diff the tree that was exported with the new tree. The naive approach is to upload new and modified files and remove deleted files.
Note that a file may have been partially uploaded to an export, and then the export updated to a tree without that file. So, need to try to delete all removed files, even if location tracking does not say that the special remote contains them.
With rename detection, if the special remote supports moving files, more efficient updates can be done. It gets complicated; consider two files that swap names.
If the special remote supports copying files, that would also make some updates more efficient.
resuming exports
Resuming an interrupted export needs to work well.
There are two cases here:
- Some of the files in the tree have been uploaded; others have not.
- A file has been partially uploaded.
These two cases need to be disentangled somehow in order to handle them. One way is to use the location log as follows:
- Before a file is uploaded, look up what key is currently exported using that filename. If there is one, update the location log, saying it's not present in the special remote.
- Upload the file.
- Update the location log for the newly exported key.
Note that this method does not allow resuming a partial upload by appending to a file, because we don't know if the file actually started to be uploaded, or if the file instead still has the old key's content. Instead, the whole file needs to be re-uploaded.
Alternative: Keep an index file that's the current state of the export. See comment #4 of export. Not sure if that works? Perhaps it would be overkill if it's only used to support resuming partial uploads.
changes to special remote interface
This needs some additional methods added to special remotes, and to the external special remote protocol.
TRANSFEREXPORT STORE|RETRIEVE Key File Name
Requests the transfer of a File on local disk to or from a given Name on the special remote.
The Name will be in the form of a relative path, and may contain path separators, whitespace, and other special characters.
The Key is provided in case the special remote wants to use egSETURIPRESENT
.
The remote responds with eitherTRANSFER-SUCCESS
orTRANSFER-FAILURE
, and a remote where exports do not make sense may always fail.CHECKPRESENTEXPORT Key Name
Requests the remote to check if a Name is present in it.
The remote responds withCHECKPRESENT-SUCCESS
,CHECKPRESENT-FAILURE
, orCHECKPRESENT-UNKNOWN
.REMOVEEXPORT Key Name
Requests the remote to remove content stored byTRANSFEREXPORT
.
The Key is provided in case the remote wants to use egSETURIMISSING
. The remote responds with eitherREMOVE-SUCCESS
orREMOVE-FAILURE
.RENAMEEXPORT Key OldName NewName
Requests the remote rename a file stored on it from OldName to NewName.
The Key is provided in case the remote wants to use egSETURIMISSING
andSETURIPRESENT
.
The remote responds withRENAMEEXPORT-SUCCESS,
RENAMEEXPORT-FAILURE, or with
RENAMEEXPORT-UNSUPPORTED` if an efficient rename cannot be done.
To support old external special remote programs that have not been updated
to support exports, git-annex will need to handle an ERROR
response
when using any of the above.
location tracking
Does a copy of a file exported to a special remote count as a copy of a file as far as numcopies goes? Should git-annex get download a file from an export? Or should exporting not update location tracking?
The problem is that special remotes with exports are not key/value stores. The content of a file can change, and if multiple repositories can export a special remote, they can be out of sync about what files are exported to it.
To avoid such problems, when updating an exported file on a special remote, the key could be recorded there too. But, this would have to be done atomically, and checked atomically when downloading the file. Special remotes lack atomicity guarantees for file storage, let alone for file retrieval.
Possible solution: Make exporttree=true cause the special remote to be untrusted, and rely on annex.verify to catch cases where the content of a file on a special remote has changed. This would work well enough except for when the WORM or URL backend is used. So, prevent the user from exporting such keys. Also, force verification on for such special remotes, don't let it be turned off.
recording exported filenames in git-annex branch
In order to download the content of a key from a file exported
to a special remote, the filename that was exported needs to somehow
be recorded in the git-annex branch. How to do this? The filename could
be included in the location tracking log or a related log file, or
the exported tree could be grafted into the git-annex branch
(under eg, exported/uuid/
). Which way uses less space in the git repository?
Grafting in the exported tree records the necessary data, but the file-to-key map needs to be reversed to support downloading from an export. It would be too expensive to traverse the tree each time to hunt for a key; instead would need a database that gets populated once by traversing the tree.
On the other hand, for updating what's exported, having access to the old exported tree seems perfect, because it and the new tree can be diffed to find what changes need to be made to the special remote.
If the filenames are stored in the location tracking log, the exported tree could be reconstructed, but it would take O(N) queries to git, where N is the total number of keys git-annex knows about; updating exports of small subsets of large repositories would be expensive. So grafting in the exported tree seems the better approach.
export conflicts
What if different repositories can access the same special remote, and different trees get exported to it concurrently?
This would be very hard to untangle, because it's hard to know what content was exported to a file last, and thus what content the file actually has. The location log's timestamps might give a hint, but clocks vary too much to trust it.
Also, if the exported tree is grafted in to the git-annex branch, there would be a merge conflict. Union merging would scramble the exported tree, so even if a smart merge is added, old versions of git-annex would corrupt the exported tree.
To avoid that problem, add a log file exported/uuid.log
that lists
the sha1 of the exported tree and the uuid of the repository that exported it.
To avoid the exported tree being GCed, do graft it in to the git-annex
branch, but follow that with a commit that removes the tree again,
and only update refs/heads/git-annex
after making both commits.
If exported/uuid.log
contains multiple active exports, there was an
export conflict. Short of downloading the whole export to checksum it,
or deleting the whole export, what can be done to resolve it?
In this case, git-annex knows both exported trees. Have the user provide a tree that resolves the conflict as they desire (it could be the same as one of the exported trees, or some merge of them). Then diff each exported tree in turn against the resolving tree. If a file differs, re-export that file. In some cases this will do unncessary re-uploads, but it's reasonably efficient.
The documentation should suggest strongly only exporting to a given special remote from a single repository, or having some other rule that avoids export conflicts.
E.g. when exporting to the S3 bucket with versioning turned on, or OSF (AFAIK). So upon successful upload special remote could SETURLPRESENT to signal availability of any particular key (associated with the file).
Yet to grasp the cases you outlined better to see if I see any other applicable use-ase
I hope that export would be implemented through extending externals special remote protocol? ;)
[[!meta author=yoh]]
just wondered... at least in my attempt for zenodo special remote I did store zenodo's file deposition ID within the state to be able to request it back later on alternative -- URL(s) I guess. Could be smth like exported:UUID/filename.
or it could be just a mode of operation for a special remote depending on "exporttree=true" being set, where in one (old) case it would operate based on keys associated with the files pointed on the cmdline (or just keys for --auto or pointed by metadata), whenever when "exporttree=true" -- it would operate on filenames pointed on command line (or files found to be associated with the keys as pointed by --auto or by metadata)? Then the same 'copy --to' could be used in both cases, streamlining user experience ;)
I've added a section with changes to the external special remote protocol. I included the Key in each of the new protocol commands, although it's not strictly neeed, to allow the implementation to use SETURLPRESENT, SETSTATE, etc.
git annex copy $file --to myexport
could perhaps work; the difficulty though is, what if you've exported branch foo, and then checked out bar, and so you told it to export one version of the file, and are running git-annex copy on a different version? It seems that git-annex would have to cross-check in this and similar commands, to detect such a situation. Unsure how much more work that would be, both CPU time and implementation time.I do think that
git annex get
could download files from exports easily enough, but see the "location tracking" section for trust caveats.I'm not clear about what you're suggesting be done with versioning support in external special remotest?
thanks -- I will check those all out!
Meanwhile a quick one regarding "I'm not clear about what you're suggesting be done with versioning support in external special remotes?".
I meant that in some cases there might be no need for any custom/special tracking per exported file would be needed -- upon export we could just register a unique URL for that particular version of the file for the corresponding KEY so later on it could be 'annex get'ed even if a new version of the file gets uploaded or removed. So annex could just store those treeish(es) hexsha on what was exported last without any explicit additional tracking per file. URL might be some custom one to be handled by the special remote backend.
E.g. here is a list of versions (and corresponding urls) for a sample file on the s3 bucket
[[!format sh """ $> datalad ls -aL s3://datalad-test0-versioned/3versions-allversioned.txt Connecting to bucket: datalad-test0-versioned [INFO ] S3 session: Connecting to the bucket datalad-test0-versioned Bucket info: Versioning: {'MfaDelete': 'Disabled', 'Versioning': 'Enabled'} Website: datalad-test0-versioned.s3-website-us-east-1.amazonaws.com ACL: <Policy: yoh@cs.unm.edu (owner) = FULL_CONTROL> 3versions-allversioned.txt ... http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=Kvuind11HZh._dCPaDAb0OY9dRrQoTMn [OK] 3versions-allversioned.txt ... http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=b.qCuh7Sg58VIYj8TVHzbRS97EvejzEl [OK] 3versions-allversioned.txt ... http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=pNsV5jJrnGATkmNrP8.i_xNH6CY4Mo5s [OK] 3versions-allversioned.txt_sameprefix ... http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt_sameprefix?versionId=Mvsc4FgJWc6gExwSw1d6wsLrnk6wdDVa [OK] """]]
That would almost work without any smarts on the git-annex side. When it tells the special remote to
REMOVEEXPORT
, the special remote could remove the file from the HEAD equivilant but retain the content in its versioned snapshots, and keep the url to that registered. But, that doesn't actually work, because the url is registered for that special remote, not the web special remote. Once git-annex thinks the file has been removed from the special remote, it will never try to use the url registered for that special remote.So, to support versioning-capable special remotes, there would need to be an additional response to
REMOVEEXPORT
that says "I removed it from HEAD, but I still have a copy in this url, which can be accessed using the web special remote".DAV = “Distributed Authoring and Versioning.”, but versioning was forgotten about in the original RFC. Only some servers/clients implement DeltaV spec (RFC 3253) which came later to fill that gap. But in principle, any DeltaV-compliant WebDAV special remote could then be used for "export" while retaining access to all the versions. References: - WebDAV and Autoversioning - Version Control with Subversion - RFC 3253
I have got interested whenever saw that box.com is supported through WebDAV but not sure if DeltaV is anyhow supported and apparently number of versions stored per file is anyways depends on type of the account (and no versions for a free personal one): https://community.box.com/t5/How-to-Guides-for-Managing/How-To-Track-Your-Files-and-File-Versions-Version-History/ta-p/329
I also wonder if
SETURLPRESENT Key Url
could also be extended to beSETURLPRESENT Key Url Remote
, i.e. that a custom remote could register a URL to Web remote? In many cases I expect a "custom uploader/exporter" but then public URL being available, so demanding a custom external remote to fetch it would be a bit overkill.N.B. I already was burnt once on a large scale with our custom remote truthfully replying to CLAIMURL (since it can handle them if needed) to public URLs, thus absorbing them into it instead of relaying responsibility to 'Web' remote. Had to traverse dozens of datasets and duplicate urls from 'datalad' to 'Web' remote.
TRANSFEREXPORT STORE|RETRIEVE Key File Name
-- note that File could also contain spaces etc (not only the Name), so should be encoded somehow?old external special remote programs ... need to handle an ERROR response
-- why not just to boost protocolVERSION
to e.g.2
so those which implement this would reply with a new version #?In some cases, if remote supports versioning, might be cool to be able to export all versions (from previously exported point, assuming linear progression). Having a chat with https://quiltdata.com/ folks, project which I just got to know about. 1. They claim/hope to provide infinite storage for public datasets 2. They do support "File" model, so dataset could simply contain files. If we could (ab)use that -- sounds like a lovely free ride 3. They do support versioning. If we could export all the versions -- super lovely.
Might also help to establish interoperability between the tools