I'm looking for a tool for efficiently syncing large datasets in HDF5 format. Each file is ~20-30GB in size and contains multiple large floating point arrays, which are usually chunked and compressed using BLOSC. I need to keep these datasets regularly backed up to a central server. I'm currently just running a cron job that calls `rsync -haP` in order to do this.

I often need to reprocess just one of the arrays within a dataset, so the amount of data that has actually changed within a dataset (and therefore needs backing up) is often quite small (~100s of MB). However, rsync is only capable of detecting differences between versions of the file as a whole, and will blindly sync the whole 20GB dataset up to the remote server as soon as it is even slightly modified. This obviously takes a long time and uses huge amounts of bandwidth.

What I'm looking for is a tool that will recursively scan for changes within each individual node in an HDF5 file (perhaps by computing a checksum), and propagate any changes it detects to a remote copy of the file. Ideally, it should be agnostic about the compression, chunk size etc. of the data.

The only remotely similar thing I've found is `ptrepack`, but this just copies nodes from one file to another - I'm not aware of any facility for checking one version of a node against another. I suspect I may end up writing something myself to do the job, but I thought I would ask here first to avoid duplicating someone else's effort!

Similar questions and discussions