Common datasets#

The Data Analytics platform maintains a set of commonly used resources in /datasets/cbmr_shared.

This folder contains a range of public datasets, including

  • genomes/ for commonly used reference genomes, such as various human reference genomes.

  • databases/ for databases such as dbSNP, ClinVar, and more.

  • resources/ for heterogeneous resources and resources used by specific tools, such as the GATK resource bundle, VEP annotation caches, BLAST, and index files generated by BWA or Bowtie2.

README files are included in each (sub)folder with additional information about the sources, versions, and command-line options used to generate the dataset, as appropriate.

Requesting additional datasets#

New datasets and new releases of datasets are added by request. To request that a dataset, including new releases of an existing dataset, be added simply contact us with the location (URL) of the dataset and the specific release you are interested in. For datasets with multiple versions, e.g. for different genomic builds, please specify the version you are interested in.

Note that the data must have been released under a license that allows it to be accessed by all employees of CBMR, their collaborators, students, and anyone else who might get access to Esrum. Data under restrictive licenses should instead be added to a private project or dataset folder, such that only those with a valid license can access the data.