Internal data transfers¶

This page describes how to transfer data that is already on Esrum, to another location on Esrum. All transfers must be performed on a compute node, as described below. Transfers running on the head node, or on the RStudio nodes, will be terminated on sight, as these impact all users of those nodes.

Transferring data between projects or datasets¶

As a rule of thumb, data should only be located in one project or dataset folder. However, should you need to make a copy of one or more files, then it is recommended to use the rsync command to do so. See the Rsync basics section below for more information.

You must run your copy commands (whether you use rsync, cp, or some other tool) on a compute node, either in an interactive sessions, or by using srun to execute the command on a compute node, as shown in the examples below. See the Running commands using srun section for more information about using srun.

If you are copying data from a /projects folder, use the command

srun rsync -av --progress /copy/this/data/ /to/this/location/

If you are copying data from a /datasets folder, use the command

srun rsync -av --no-group --chmod=ugo=rwX --progress /copy/this/data/
/to/this/location/

Warning

Do not copy data out of -AUDIT folders without the explicit permission from the data controller and never store sensitive data in a non--AUDIT folder!

Warning

Transfers running on the head node will be terminated without warning, due to the impact on other users of the cluster.

Tip

Running your transfer in a tmux or screen session is recommended. This allows your transfer to keep running after you log off from Esrum. See the Persistent sessions with tmux page for more information.

Copying data to/from the H:, N:, and S: drives¶

To avoid impacting other users, you must run transfers on compute nodes. However, as described on the Network drives (H:, N:, S:) page, the H:, N:, and S: drives are not accessible from compute nodes by default.

Therefore, you must start an interactive session, log in using the /usr/bin/kinit command, and then access the network drives via the /maps folder:

# Start an interactive session
srun --pty -- /bin/bash
# Log in to enable the network drives
/usr/bin/kinit
# View my H: drive; '${USER}' corresponds to your abc123 username
ls /maps/hdir/${USER}/

Your login will expire after about 12 hours, at which point you have to run /usr/bin/kinit on the node again. However, while your login is active, your network folders can be found at the following locations:

Drive	Location
`H:`	`/maps/hdir/${USER}`
`S:`	`/maps/sdir/${USER}`
`N:`	`/maps/groupdir/${USER}`

Note that these folders will be only created once you attempt to access them, provided that you have logged in using /usr/bin/kinit.

It is recommended to use rsync to copy data to/from the network-drives, as described below, but you do not need to use srun in this case, as you are already working in an interactive session if you followed the instructions above.

Warning

Do not copy data out of -AUDIT folders without the explicit permission from the data controller and never store sensitive data in a non--AUDIT folder!

Warning

Transfers running on the head node will be terminated without warning, due to the impact on other users of the cluster.

Tip

Rsync basics¶

rsync allows you to recursively copy data between two locations, either on the same system or between two different systems (via SSH). Unlike plain cp, it is also easy to resume a transfer that has been interrupted, simply by running rsync again.

The basic rsync command you should be using is

rsync -av --progress /copy/this/data/ /to/this/location/

The -a option enables "archive" mode, which preserves meta-information such as timestamps and permissions.
The -v option and the --progress options are optional, but make rsync list the last copied file and the progress when copying (large) files.
The paths in the above example both ends in a /. This is intentional, and makes rsync copy the content of data into the folder location. If you instead ran rsync -av --progress /copy/this/data /to/this/location/, then the data folder would be placed at /to/this/location/data

However, when copying data from a /datasets it is necessary to add the --no-perms --chmod=ugo=rwX options, since rsync would otherwise set all permissions to 000, due to how access-control is implemented for /datasets folder. See the troubleshooting section below if you forget to add this option.

You must run rsync command on a compute node, either in an interactive sessions, or by using srun to automatically run the command on a compute node. See the Running commands using srun section for more information about using srun.

Copying instrument data to projects or datasets¶

As the /labs folders are currently only accessible from the head node, it is necessary to run the transfers directly on the head node. This is the only case where it is permitted to run transfers on the head node, and these transfers must be rate-limited to at most 50 MB/s (total) using the rsync --bwlimit option:

$ rsync -av --no-perms --chmod=ugo=rwX --progress=summary --bwlimit=50M /from/path/ /to/path/

Warning

Similarly to /datasets folders, all files and folders on /labs drives have permissions 000, i.e. no read and no write access, even when you have access to the data. For this reason, you must include the --no-perms --chmod=ugo=rwX options when running rsync, to prevent rsync from recreating these permissions. If you omit --no-perms --chmod=ugo=rwX, then rsync normally fails during the transfer, due not being able to write to the destination.

If you run transfers without rate limits (include using cp or mv to copy/move data in or out of /labs folders), or if you run transfers with a total rate limit above 50 MB/s, then these will be terminated to prevent them from impacting other users of Esrum.

See Copying instrument data to projects or datasets for more information.

Troubleshooting¶

`rsync` fails with `Permission denied` when copying from `/datasets`¶

If you forget to use the appropriate options when rsync'ing data out of a /datasets folder, then all permissions will be set to 000. In other words, nobody can read, write, or execute those files and folders.

To fix this, first run the following commands to fix the permissions, where /path/to/copied/data is the path to the copy of the data that you have created.

chmod -R +rX,u+w /path/to/copied/data

This will recursively mark files and folders readable for everyone, mark folders executable for everyone (required to browse them), and mark files and folders writable for you (and only you).

Then re-run rsync and remember to include the appropriate options, as described in the Rsync basics section.

`Permission denied` when accessing data copied from `/datasets`¶

See above.

The `~/ucph` folder or subfolders are missing¶

Note that the ~/ucph folder is only available on the head node (esrumhead01fl), and not on the RStudio servers nor on the compute nodes. See the Accessing network drives from compute nodes section for how to access the drives elsewhere.

If you are connected to the head node, then firstly make sure that you are not using GSSAPI (Kerberos) to log in. See the Connecting to the cluster page for instructions for how to disable this feature if you are using MobaXterm.

Once you have logged in to Esrum without GSSAPI enabled, and if the folder(s) are still missing, then run the following command to create any missing network folders:

$ bash /etc/profile.d/symlink-ucphmaps.sh

Once this is done, you should have a ucph symlink in your home folder containing links to hdir (H:), ndir (N:), and sdir (S:).

`No such file or directory` when accessing network drives¶

If you get a No such file or directory error when attempting to access the network drives (~/ucph/hdir, ~/ucph/ndir, or ~/ucph/sdir), then please make sure that you are not logging in using Kerberos (GSSAPI). See the Accessing network drives via MobaXterm section for instructions for how to disable this feature if you are using MobaXterm.

Note also that your login is also valid for about 10 hours, after which you will lose access to the network drives. See the section (Re)activating access to the network drives for how to re-authenticate if your access has timed out.

`kinit: Unknown credential cache type while getting default ccache`¶

The kinit command may fail if you are using a conda environment:

(base) $ kinit
kinit: Unknown credential cache type while getting default ccache

To circumvent this problem, either specify the full path to the kinit executable (i.e. /usr/bin/kinit) or deactivate the current/base environment by running conda deactivate until conda is completely deactivated.

Internal data transfers¶

Transferring data between projects or datasets¶

Copying data to/from the H:, N:, and S: drives¶

Rsync basics¶

Copying instrument data to projects or datasets¶

Troubleshooting¶

rsync fails with Permission denied when copying from /datasets¶

Permission denied when accessing data copied from /datasets¶

The ~/ucph folder or subfolders are missing¶

No such file or directory when accessing network drives¶

kinit: Unknown credential cache type while getting default ccache¶

`rsync` fails with `Permission denied` when copying from `/datasets`¶

`Permission denied` when accessing data copied from `/datasets`¶

The `~/ucph` folder or subfolders are missing¶

`No such file or directory` when accessing network drives¶

`kinit: Unknown credential cache type while getting default ccache`¶