How we built the IDR

(And how you can build it too)

Simon Li

The Image Data Repository

A public resource to store, integrate and serve image datasets from published scientific studies
42 TB raw data
2.8 million images

Overview Part 1: IDR Development

Get hold of the data
Buy some disks and servers
Install OMERO.server (Docker)
Setup a public OMERO.web
Import some data
Discover bugs in OMERO
Discover bugs in hardware setup
Import more data
Discover more bugs
Import more data
Discover more bugs
....

August 2015: Dundee: our hardware

Dell PowerVault MD3860f storage array + expansion enclosure: 450+ TB (300+ TB usable)

2 FC630 Dell storage servers (12 cores, 128 GB RAM)

6 FC630 Dell compute servers (28 cores, 256 GB RAM)

Dundee storage platform: GPFS

Used by the School of Life Sciences
Can be shared amongst multiple servers
Takes a lot of tuning

Dundee compute servers

Production

Development

September 2015: OMERO on Docker


yum install docker

docker run -d -v /idr:/idr openmicroscopy/omero-deploy

Don't use omero-deploy, look out for a new production Docker image

September 2015: Importing data


omero import -- --checksum-algorithm=File-Size-64 --transfer=ln_s ...

See Session 2: Extensible Import for details

System layout

front-end proxy

October 2015: Official launch (EMBL Seeing is Believing)

Overview Part 2: IDR Production

Write Ansible playbooks, deploy on Dundee cloud
Mirror final deployment environment (OpenStack)
Reproduce setup on EBI Embassy cloud
Performance tuning

What is configuration management?

A systematic way to setup and configure your servers
"Infrastructure as code"

"Infrastructure as code"

Reproducible installations
Documentation
Version control

What can it do for OMERO?

Open-source our infrastructure
Manage all dependencies
Install, configure, upgrade

November 2015: OpenStack at Dundee

Open-source cloud platform used at EMBL-EBI
Gives us full admin access to a private cloud
Difficult to install

OMERO on Docker on virtual machines?

✔ Easy to deploy OMERO
✔ Easy to upgrade
✘ Another layer of infrastructure to debug

What we'd really like: someone else to manage the infrastructure

February 2016: OMERO on OpenStack with Ansible

First production use of Ansible with OMERO
Single-node deployment
Uses omego, our OMERO installation and upgrade tool


# omero.yml
- hosts: localhost
  roles:
  - omero-server
    postgresql_users_databases:
    - user: omero
      password: omero
      databases: [omero]


git clone https://github.com/openmicroscopy/infrastructure

ansible-playbook run omero.yml

March 2016: Copy 40 TB of data from Dundee to Cambridge (400 miles)

Option 1: Aspera (FTP on steroids)
Option 2: Fedex

April 2016: Multi-node OMERO OpenStack deployment

Create virtual machines with Ansible
Install and configure OMERO with Ansible


source openstack-credentials.env

ansible-playbook -i inventory/openstack.py -e @vars/ome2016-overrides.yml os-idr-uod.yml

May 2016: Redeploy at EBI


- include: os-create.yml
  vars:
    omero_vm_extra_groups: "u⃫o⃫d⃫-⃫n⃫f⃫s⃫ ebi-nfs,idr-hosts"
    os_cloud_provider: u⃫o⃫d⃫ ebi

- include: os-volumes.yml
  vars:
    os_cloud_provider: u⃫o⃫d⃫ ebi

- hosts: database-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb
    storage_volume_initialise_mount: /var/lib/pgsql

- hosts: omero-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb1
    storage_volume_initialise_mount: /data

- hosts: proxy-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb
    storage_volume_initialise_mount: /var/cache/nginx

- include: i⃫d⃫r⃫-⃫d⃫u⃫n⃫d⃫e⃫e⃫-⃫n⃫f⃫s⃫.⃫y⃫m⃫l⃫ idr-ebi-nfs.yml

- include: os-omero.yml

May 2016: Production tuning

OMERO.server: Loading screens is slow
- Each screen is several GB
- BioFormats may have to load 1000s of files
Clients: Multiple HTTP requests needed to load a large plate (e.g. 384 thumbnails)
But at least it's read-only and web-only (for now)

May 2016: Production tuning

OMERO.server: Bio-Formats cache
Nginx: Very aggressive front-end caching
Clients: Optimise HTTP protocol

May 2016: Production tuning

Scripts to pre-warm cache

All thumbnails (2.8 million)
All metadata for Screens, Plates and Datasets (3400)

May 2016: Production tuning

Cached on first request

Images (36 million planes)
Image metadata (2.8 million)

May 2016: Production tuning

Web session management: Redis
Client speed-up: HTTP2

May 2016: IDR demo-2

OME ansible playbooks (under development): https://github.com/openmicroscopy/infrastructure/
OME Docker images (under development): https://hub.docker.com/u/openmicroscopy/
Image Data Repository: http://idr-demo.openmicroscopy.org/

What next?

Thank you

Prof. Jason Swedlow
OME team

How we built the IDR

(And how you can build it too)

Simon Li

The Image Data Repository

Overview Part 1: IDR Development

August 2015: Dundee: our hardware

Dundee storage platform: GPFS

Dundee compute servers

September 2015: OMERO on Docker

September 2015: Importing data

System layout

October 2015: Official launch (EMBL Seeing is Believing)

Overview Part 2: IDR Production

What is configuration management?

"Infrastructure as code"

What can it do for OMERO?

November 2015: OpenStack at Dundee

OMERO on Docker on virtual machines?

February 2016: OMERO on OpenStack with Ansible

March 2016: Copy 40 TB of data from Dundee to Cambridge (400 miles)

April 2016: Multi-node OMERO OpenStack deployment

May 2016: Redeploy at EBI

May 2016: Production tuning

May 2016: Production tuning

May 2016: Production tuning

Scripts to pre-warm cache

May 2016: Production tuning

Cached on first request

May 2016: Production tuning

May 2016: IDR demo-2

Resources

What next?

Thank you