Internal data transfers

This page describes how to transfer data that is already on Esrum, to another location on Esrum. All transfers must be performed on a compute node, as described below. Transfers running on the head node, or on the RStudio nodes, will be terminated on sight, as these impact all users of those nodes.

Transferring data between projects or datasets

As a rule of thumb, data should only be located in one project or dataset folder. However, should you need to make a copy of one or more files, then it is recommended to use the rsync command to do so. See the Rsync basics section below for more information.

You must run your copy commands (whether you use rsync, cp, or some other tool) on a compute node, either in an interactive sessions, or by using srun to execute the command on a compute node, as shown in the examples below. See the Running commands using srun section for more information about using srun.

  1. If you are copying data from a /projects folder, use the command

    srun rsync -av --progress /copy/this/data/ /to/this/location/
    
  2. If you are copying data from a /datasets folder, use the command

    srun rsync -av --no-perms --progress /copy/this/data/
    /to/this/location/
    

Warning

Do not copy data out of -AUDIT folders without the explicit permission from the data controller and never store sensitive data in a non--AUDIT folder!

Warning

Transfers running on the head node will be terminated without warning, due to the impact on other users of the cluster.

Tip

Running your transfer in a tmux or screen session is recommended. This allows your transfer to keep running after you log off from Esrum. See the Persistent sessions with tmux page for more information.

Copying data to/from the H:, N:, and S: drives

To avoid impacting other users, you must run transfers on compute nodes. However, as described on the Network drives (H:, N:, S:) page, the H:, N:, and S: drives are not accessible from compute nodes by default.

Therefore, you must start an interactive session, log in using the /usr/bin/kinit command, and then access the network drives via the /maps folder:

# Start an interactive session
srun --pty -- /bin/bash
# Log in to enable the network drives
/usr/bin/kinit
# View my H: drive; '${USER}' corresponds to your abc123 username
ls /maps/hdir/${USER}/

Your login will expire after about 12 hours, at which point you have to run /usr/bin/kinit on the node again. However, while your login is active, your network folders can be found at the following locations:

Drive

Location

H:

/maps/hdir/${USER}

S:

/maps/sdir/${USER}

N:

/maps/groupdir/${USER}

Note that these folders will be only created once you attempt to access them, provided that you have logged in using /usr/bin/kinit.

It is recommended to use rsync to copy data to/from the network-drives, as described below, but you do not need to use srun in this case, as you are already working in an interactive session if you followed the instructions above.

Warning

Do not copy data out of -AUDIT folders without the explicit permission from the data controller and never store sensitive data in a non--AUDIT folder!

Warning

Transfers running on the head node will be terminated without warning, due to the impact on other users of the cluster.

Tip

Running your transfer in a tmux or screen session is recommended. This allows your transfer to keep running after you log off from Esrum. See the Persistent sessions with tmux page for more information.

Rsync basics

rsync allows you to recursively copy data between two locations, either on the same system or between two different systems (via SSH). Unlike plain cp, it is also easy to resume a transfer that has been interrupted, simply by running rsync again.

The basic rsync command you should be using is

rsync -av --progress /copy/this/data/ /to/this/location/
  • The -a option enables "archive" mode, which preserves meta-information such as timestamps and permissions.

  • The -v option and the --progress options are optional, but make rsync list the last copied file and the progress when copying (large) files.

  • The paths in the above example both ends in a /. This is intentional, and makes rsync copy the content of data into the folder location. If you instead ran rsync -av --progress /copy/this/data /to/this/location/, then the data folder would be placed at /to/this/location/data

However, when copying data from a /datasets it is necessary to add the --no-perms options, since rsync would otherwise set all permissions to 000, due to how access-control is implemented for /datasets. See the troubleshooting section below if you forget to add this option.

You must run rsync command on a compute node, either in an interactive sessions, or by using srun to automatically run the command on a compute node. See the Running commands using srun section for more information about using srun.

Troubleshooting

rsync fails with Permission denied when copying from /datasets

If you forget to use the --no-perms option when rsync'ing data out of a /datasets folder, then all permissions will be set to 000. In other words, nobody can read, write, or execute those files and folders.

To fix this, first run the following commands to fix the permissions, where /path/to/copied/data is the path to the copy of the data that you have created.

chmod -R +rX,u+w /path/to/copied/data

This will recursively mark files and folders readable for everyone, mark folders executable for everyone (required to browse them), and mark files and folders writable for you (and only you).

Then re-run rsync and remember to include the --no-perms option.

Permission denied when accessing data copied from /datasets

See above.

The ~/ucph folder or subfolders are missing

Note that the ~/ucph folder is only available on the head node (esrumhead01fl), and not on the RStudio servers nor on the compute nodes. See the Accessing network drives from compute nodes section for how to access the drives elsewhere.

If you are connected to the head node, then firstly make sure that you are not using GSSAPI (Kerberos) to log in. See the Connecting to the cluster page for instructions for how to disable this feature if you are using MobaXterm.

Once you have logged in to Esrum without GSSAPI enabled, and if the folder(s) are still missing, then run the following command to create any missing network folders:

$ bash /etc/profile.d/symlink-ucphmaps.sh

Once this is done, you should have a ucph symlink in your home folder containing links to hdir (H:), ndir (N:), and sdir (S:).

No such file or directory when accessing network drives

If you get a No such file or directory error when attempting to access the network drives (~/ucph/hdir, ~/ucph/ndir, or ~/ucph/sdir), then please make sure that you are not logging in using Kerberos (GSSAPI). See the Accessing network drives via MobaXterm section for instructions for how to disable this feature if you are using MobaXterm.

Note also that your login is also valid for about 10 hours, after which you will lose access to the network drives. See the section (Re)activating access to the network drives for how to re-authenticate if your access has timed out.

kinit: Unknown credential cache type while getting default ccache

The kinit command may fail if you are using a conda environment:

(base) $ kinit
kinit: Unknown credential cache type while getting default ccache

To circumvent this problem, either specify the full path to the kinit executable (i.e. /usr/bin/kinit) or deactivate the current/base environment by conda deactivate.