Is there an existing tool for syncing changes to individual nodes within HDF5 files?

08 August 2014 4 3K Report

I'm looking for a tool for efficiently syncing large datasets in HDF5 format. Each file is ~20-30GB in size and contains multiple large floating point arrays, which are usually chunked and compressed using BLOSC. I need to keep these datasets regularly backed up to a central server. I'm currently just running a cron job that calls `rsync -haP` in order to do this.

I often need to reprocess just one of the arrays within a dataset, so the amount of data that has actually changed within a dataset (and therefore needs backing up) is often quite small (~100s of MB). However, rsync is only capable of detecting differences between versions of the file as a whole, and will blindly sync the whole 20GB dataset up to the remote server as soon as it is even slightly modified. This obviously takes a long time and uses huge amounts of bandwidth.

What I'm looking for is a tool that will recursively scan for changes within each individual node in an HDF5 file (perhaps by computing a checksum), and propagate any changes it detects to a remote copy of the file. Ideally, it should be agnostic about the compression, chunk size etc. of the data.

The only remotely similar thing I've found is `ptrepack`, but this just copies nodes from one file to another - I'm not aware of any facility for checking one version of a node against another. I suspect I may end up writing something myself to do the job, but I thought I would ask here first to avoid duplicating someone else's effort!

Mathias Slawik

I don't know the specifics of HDF5, but I've had good results with syncing large files using "Bittorrent Sync", which is a free multi-platform peer-to-peer file syncing tool. It does allow encrypted and secure file syncing (dropbox-like) using own servers. It divides large files in 4 MB "chunks" and only syncs the "chunks" which have changed. So I suppose you could sync portions of very large files efficiently.

http://www.bittorrent.com/sync

M. O. Vasil’ev

Consider using zbackup (www.

M. O. Vasil’ev

N.B. zbackup makes a directory with small files from arbitrary uncompressed data, so you can use rsync to create a copy of your data to remote server. And it will transfer actually changed data with some metadata. As a gift, you will be able to restore any previously backed up version of your data and data will be stored in deduplicated way.

Mathias Slawik

You could also put the original file, as well as the backup file on a ZFS file system and make regular snapshots.The great thing about ZFS is that you can send filesystem snapshot deltas over the network and it will only transfer the changed file system blocks (default: 128kByte). ZFS works well on Linux, BSD and Solaris.

What are the long-term impacts of incarceration on youths' developing brain?

GC-MS retention index prediticon?

Separation of organic acids-HPLC?

What should Berlin do as a city to become as impactful as London and Paris in World Football?

Which test should be used to study association among demographic profile and awarness level?

Can anyone provide me with molecular docking softwares/ websites?

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

How to find projects for a team of professional developers in AI?

Help on understanding the implementation of Mori Tanaka method on MATLAB?

Difficulty with permittivitt and Magnetic Permeability Calculations?