How we built the IDR

(And how you can build it too)

Simon Li

The Image Data Repository

  • A public resource to store, integrate and serve image datasets from published scientific studies
  • 42 TB raw data
  • 2.8 million images

Overview Part 1: IDR Development

  1. Get hold of the data
  2. Buy some disks and servers
  3. Install OMERO.server (Docker)
  4. Setup a public OMERO.web
  5. Import some data
  6. Discover bugs in OMERO
  7. Discover bugs in hardware setup
  8. Import more data
  9. Discover more bugs
  10. Import more data
  11. Discover more bugs
  12. ....

August 2015: Dundee: our hardware

  • Dell PowerVault MD3860f storage array + expansion enclosure: 450+ TB (300+ TB usable)
    • 2 FC630 Dell storage servers (12 cores, 128 GB RAM)
  • 6 FC630 Dell compute servers (28 cores, 256 GB RAM)

Dundee storage platform: GPFS

  • Used by the School of Life Sciences
  • Can be shared amongst multiple servers
  • Takes a lot of tuning

Dundee compute servers

Production
Development

September 2015: OMERO on Docker


yum install docker

docker run -d -v /idr:/idr openmicroscopy/omero-deploy
          

Don't use omero-deploy, look out for a new production Docker image

September 2015: Importing data


omero import -- --checksum-algorithm=File-Size-64 --transfer=ln_s ...
          

See Session 2: Extensible Import for details

System layout

front-end proxy

October 2015: Official launch (EMBL Seeing is Believing)

Overview Part 2: IDR Production

  1. Write Ansible playbooks, deploy on Dundee cloud
  2. Mirror final deployment environment (OpenStack)
  3. Reproduce setup on EBI Embassy cloud
  4. Performance tuning

What is configuration management?

  • A systematic way to setup and configure your servers
  • "Infrastructure as code"

"Infrastructure as code"

  • Reproducible installations
  • Documentation
  • Version control

What can it do for OMERO?

  • Open-source our infrastructure
  • Manage all dependencies
  • Install, configure, upgrade

November 2015: OpenStack at Dundee

  • Open-source cloud platform used at EMBL-EBI
  • Gives us full admin access to a private cloud
  • Difficult to install

OMERO on Docker on virtual machines?

  • Easy to deploy OMERO
  • Easy to upgrade
  • Another layer of infrastructure to debug

What we'd really like: someone else to manage the infrastructure

February 2016: OMERO on OpenStack with Ansible

  • First production use of Ansible with OMERO
  • Single-node deployment
  • Uses omego, our OMERO installation and upgrade tool

# omero.yml
- hosts: localhost
  roles:
  - omero-server
    postgresql_users_databases:
    - user: omero
      password: omero
      databases: [omero]
            

git clone https://github.com/openmicroscopy/infrastructure

ansible-playbook run omero.yml
            

March 2016: Copy 40 TB of data from Dundee to Cambridge (400 miles)

  • Option 1: Aspera (FTP on steroids)
  • Option 2: Fedex

April 2016: Multi-node OMERO OpenStack deployment

  • Create virtual machines with Ansible
  • Install and configure OMERO with Ansible

source openstack-credentials.env

ansible-playbook -i inventory/openstack.py -e @vars/ome2016-overrides.yml os-idr-uod.yml
            

May 2016: Redeploy at EBI


- include: os-create.yml
  vars:
    omero_vm_extra_groups: "u⃫o⃫d⃫-⃫n⃫f⃫s⃫ ebi-nfs,idr-hosts"
    os_cloud_provider: u⃫o⃫d⃫ ebi

- include: os-volumes.yml
  vars:
    os_cloud_provider: u⃫o⃫d⃫ ebi

- hosts: database-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb
    storage_volume_initialise_mount: /var/lib/pgsql

- hosts: omero-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb1
    storage_volume_initialise_mount: /data

- hosts: proxy-hosts
  roles:
  - role: storage-volume-initialise
    storage_volume_initialise_device: /dev/vdb
    storage_volume_initialise_mount: /var/cache/nginx

- include: i⃫d⃫r⃫-⃫d⃫u⃫n⃫d⃫e⃫e⃫-⃫n⃫f⃫s⃫.⃫y⃫m⃫l⃫ idr-ebi-nfs.yml

- include: os-omero.yml
          

May 2016: Production tuning

  • OMERO.server: Loading screens is slow
    • Each screen is several GB
    • BioFormats may have to load 1000s of files
  • Clients: Multiple HTTP requests needed to load a large plate (e.g. 384 thumbnails)
  • But at least it's read-only and web-only (for now)

May 2016: Production tuning

  • OMERO.server: Bio-Formats cache
  • Nginx: Very aggressive front-end caching
  • Clients: Optimise HTTP protocol

May 2016: Production tuning

Scripts to pre-warm cache

  • All thumbnails (2.8 million)
  • All metadata for Screens, Plates and Datasets (3400)

May 2016: Production tuning

Cached on first request

  • Images (36 million planes)
  • Image metadata (2.8 million)

May 2016: Production tuning

  • Web session management: Redis
  • Client speed-up: HTTP2

May 2016: IDR demo-2

Resources

OME ansible playbooks (under development)
https://github.com/openmicroscopy/infrastructure/
OME Docker images (under development)
https://hub.docker.com/u/openmicroscopy/
Image Data Repository
http://idr-demo.openmicroscopy.org/

What next?

Thank you

  • Prof. Jason Swedlow
  • OME team