How to build and run an international open-data image repository

Simon Li
Open Microscopy Environment
University of Dundee

Overview

  • Who, What, Why
  • Current deployment
  • How did we get here?
  • Where are we now?

Who am I?

Software engineer and sysadmin at the OME since 2012

Open Microscopy Environment

A consortium of universities, research labs, industry and developers producing open-source software and standards for microscopy data.

Why do we need the IDR?

Open-science

  • Open-data
  • Open-access
  • Open infrastructure

    Don't build new databases from scratch, instead re-use and build upon existing work.

Data should be

Findable

Accessible

Interoperable

Reusable

Make it easy for others to use your data

Imaging data is complicated

No it's not

Yes it is!

  • 100s of proprietary file formats in the Life sciences
  • Many different imaging modalities and scales
Single molecule
Tissue
351 Gigapixels
Timelapse
3D lightsheet
27 channels
96 well plate
384 well plate
➕Custom metadata such as experimental infomation

A public repository for reference datasets and images of interest to a broader community of users

Provides the expertise needed to curate and publish life-science imaging data effectively

From this ▶▶▶

▶▶▶ idr.openmicroscopy.org

IDR in numbers (August 2019)

5,341,278 images

19,076,141 files

125 TB

Behind the scenes

OpenStack private cloud
EMBL-EBI

Main components

OMERO: Open-source enterprise platform for managing imaging data in the Life Sciences
Bio-Formats: used by OMERO to read and write over 150 image formats including metadata
PostgreSQL: used by OMERO.server to store all metadata and file information
Nginx: Loadbalancer proxy and caching
JupyterHub: Co-located platform for analysing data in the IDR
Storage: Mix of cloud (read-write) and NFS (read-only)

Data submission and curation

Average per submission

  • 91,000 images
  • 323,000 files
  • 2.1 TB

How did we get here?

Early 2015 Work started
October 2015 Demo 1
May 2016 Demo 2
April 2017 Official launch

What we started with

Used by 100s of institutions around the world

 
Current IDR: 125 TB

At the time University of Dundee server held around 12 TB data

Everything done manually


                  $ ssh idr.server

                  # yum install java-1.8.0-openjdk
                  # yum install python-{pip,devel,virtualenv,yaml,jinja2,tables}
                  # ...
                

What was new?

  • New hardware: Servers and IBM Spectrum (GPFS) storage array
  • New way of working

Infrastructure as code

  • Apply the software development process to managing servers
  • Clear separation between data and applications

Demo 1, October 2015

It works... for a bit

  • Is it a hardware issue with new servers?
  • Is Docker too new and unreliable?
  • Bug in the software?
  • Something else?

First lesson: Verify your infrastructure at every step

Make sure all layers of your stack are reliable

  • Don't try too many new things at once
  • Virtualisation is convenient and powerful but you need to trust all layers

Demo 2, May 2016

OpenStack at EMBL-EBI with Ansible:

  • Relatively easy to setup
  • Just requires SSH access to servers
  • Everything configured with YAML

One command can provision new servers, configure networking and storage, and install the IDR, reproducibly: IDR/deployment

It works... for a bit

We have a scaling problem, OMERO just wasn't designed for the amount of data and frequency of access. These problems only occur on a big system like the IDR.

How do you debug and test a 50+ (now 100+) TB system?

IDR/deployment
Production
Staging
Test-1
Test-2

Official release: April 2017

Where are we now?

Who's working on the IDR?

Sebastien Besson
Jean-Marie Burel
Mark Carroll
David Gault
Riad Gozim
Simon Li
Dominik Lindner
Melissa Linkert
Josh Moore
Will Moore
Petr Walczysko
Frances Wong
QA/Tester
Curator

Curation: A critical factor in the success of the IDR (remember: FAIR)

  • Metadata for millions of images is manually curated
  • Constant stream of new datasets

This is one of the largest public bioimage publication systems running in the world

  • Publically available
  • Recommended by journals as a data repository
  • Whole stack is available on GitHub
40,000 visitors in the past 12 months

IDR/deployment IDR/idr-log-analysis

Jason Swedlow
Sebastien Besson
Jean-Marie Burel
Mark Carroll
David Gault
Riad Gozim
Simon Li
Dominik Lindner
Melissa Linkert
June Matthew
Josh Moore
Will Moore
Petr Walczysko
Frances Wong
Rafael Carazo-salas
Alvis Brazma
Ugis Sarkans
Simon Jupp
Tony Burdett
Aleksandra Tarkowska
Anatole Chessel
Richard Ferguson
Helen Flynn
Kenny Gillen
Roger Leigh
Simone Leo
Gabriella Rustici
Eleanor Williams

Former
members