Hub design

Design aims

Flexible

The hub environment needs to be flexible: scientists should be able to run custom analyses. This requires the ability to run scripts, through command line access.

Easy to use

Security measures are sometimes at odds with ease of use.  With the hub we have as aim to avoid this as much as possible.  This includes easy syncing of files.

Secure

The analysis is secureThe hub is ISO27001 compliant, uses 2-factor authentication, and no direct data transfer to other machines on the internet is possible.  Fine-grained permissions are used to arrange access to the different datasets.

Data harmonization

Our goal is to provide harmonized pre-processed data. With this, quality control and harmonization efforts can be shared across the various projects.

Architecture

Analysis hub

The hub consists of a user interface machine, and a (variable) number of worker nodes. A firewall prevents access to and from the internet. Access is provided to an online storage (CephFS) and a tape storage system (dCache). SSH access is only allowed through a so-called doornode.

Door node

Access to the hub will be provided through one of the bastion services. The doornode prevents file transfer, but allows SSH access (2-factor authentication) through a two step log-in process, acting as a stepping stone to the AGH user interface machine.

Authentication and authorization

The system uses the SRAM system, a self-service authentication method, .  which links to your personal institute account. Authentication to the AGH will require 2-factor authentication: username / password and an authentication token.

Data transfer node​

It is necessary to allow summary tables and plots  to be downloaded, as well as to enable upload of software and (annotation) datasets.  This is handled by a data transfer node. This node logs all upload/download actions, and keeps all data which is downloaded. It also prohibits large-scale data transfers necessary to download raw data or the full processed dataset. 

Pre-processing environment

Data is pre-processed on Snellius, the Dutch National Supercomputer, by the Hub team. Access to Snellius is safe-guarded through 2-factor authentication. 

Online storage​

This is the standard online storage system as used by Spider (Cephfs).  This file system is mounted onto the user interface machine, and worker nodes. This data storage is not accessible to users outside of the AGH clone environment.

dCache tape storage

This is the tape storage system. Data providers can directly upload their data to this system through an upload-only token. After pre-processing, data at rest on tape is stored in an encrypted format. 

Permission system

The storage system uses a fine-grained ACL permission system. Users obtain access to the datasets and projects for which they obtained permission from the data providers. 

Technical design

The Alzheimer Genetics Hub is situated at Surf. Data pre-processing occurs on the Dutch National Supercomputer Snellius. The hub itself is situated in a private network, and only has 3 user connections with the internet: SSH ('door node'), a managed data transfer node, and a secured tape storage system managed by the Hub personel.