Files/folders might become corrupted when transferring between computers or
even disks, in a way that is non obvious to your code so it won't throw a
warning, but might still affect the validity of your data.
- Solution
This problem can occur when transferring files between one cluster and another
but it can also occur at any point when file transfer is taking place, i.e.
* Transfer file from internal to external disk drive
* Transfer file from one internal disk drive to another internal disk drive
* Transfer file from one computer to another over the network
It can even take place when no operations are being performed on the file at
all. This is called bit rot and data centers that specialise in archival and
where data integrity is of high importance, employ specialised hardware and
software to detect and correct it.
For our purposes, what we can do is focus on best practices when downloading
or uploading files from or to a location. This boils down to two things:
(1) Instead of transferring multiple small files and folders it is better to
transfer a single item instead.
This can be achieved with a command like
`tar -czf directory.tgz directory` if we are interested in transferring
a single directory but can, of course, accomodate as many folders as we
require.
For the next step we need a way of generating a unique "identity" for the
tgz archive. For this we can use a checksum. One way of computing one can
be seen in the command below:
`md5sum directory.tgz`
This will print a string of alphanumeric characters (the aforementioned
"identity" of the file) followed by white-space and the filename. The
output of the command can be stored in a file for easier comparison. After
transferring the file to the destination we can run the `md5sum` command
there as well and verify the hashes are identical.
An added benefit of transferring data in a single archive is that it is
faster, as our file transfer program of choice (e.g. `scp`, or `rsync`)
only needs to negotiate a single connection.
(2) Alternatively, assuming it doesn't make sense to bundle our data in a
single archive, we can run `md5sum` on all files to be transferred and
compare all of the checksums before and after the transfer. This can be
achieved in many ways but one command that would do the trick is:
`find -L . -type f | xargs md5sum | sort -Vk2`
This can be run from a location that contains all of the files you want to
transfer. An short explanation about the various flags follows:
`-L`: This instructs `find` to follow symlinks
`.`: This instructs `find` to search in the current directory
`-type f`: This instructs `find` to only identify files
`xargs md5sum`: Run `md5sum` on all detected files
`sort -Vk2`: Sort the results lexicographically by file name to avoid
differences in the default sorting order due to location
settings
The two files produced by the above command (before and after transfer),
can be compared to ensure the transferred files are identical.
Panos