How to build and run an international open-data image repository

Simon Li
Open Microscopy Environment
University of Dundee


  • Who, What, Why
  • Current deployment
  • How did we get here?
  • Where are we now?

Who am I?

Software engineer and sysadmin at the OME since 2012

Open Microscopy Environment

A consortium of universities, research labs, industry and developers producing open-source software and standards for microscopy data.

Why do we need the IDR?


  • Open-data
  • Open-access
  • Open infrastructure

    Don't build new databases from scratch, instead re-use and build upon existing work.

Data should be





Make it easy for others to use your data

Imaging data is complicated

No it's not

Yes it is!

  • 100s of proprietary file formats in the Life sciences
  • Many different imaging modalities and scales
Single molecule
351 Gigapixels
3D lightsheet
27 channels
96 well plate
384 well plate
➕Custom metadata such as experimental infomation

A public repository for reference datasets and images of interest to a broader community of users

Provides the expertise needed to curate and publish life-science imaging data effectively

From this ▶▶▶


IDR in numbers (August 2019)

5,341,278 images

19,076,141 files

125 TB

Behind the scenes

OpenStack private cloud

Main components

OMERO: Open-source enterprise platform for managing imaging data in the Life Sciences
Bio-Formats: used by OMERO to read and write over 150 image formats including metadata
PostgreSQL: used by OMERO.server to store all metadata and file information
Nginx: Loadbalancer proxy and caching
JupyterHub: Co-located platform for analysing data in the IDR
Storage: Mix of cloud (read-write) and NFS (read-only)

Data submission and curation

Average per submission

  • 91,000 images
  • 323,000 files
  • 2.1 TB

How did we get here?

Early 2015 Work started
October 2015 Demo 1
May 2016 Demo 2
April 2017 Official launch

What we started with

Used by 100s of institutions around the world

Current IDR: 125 TB

At the time University of Dundee server held around 12 TB data

Everything done manually

                  $ ssh idr.server

                  # yum install java-1.8.0-openjdk
                  # yum install python-{pip,devel,virtualenv,yaml,jinja2,tables}
                  # ...

What was new?

  • New hardware: Servers and IBM Spectrum (GPFS) storage array
  • New way of working

Infrastructure as code

  • Apply the software development process to managing servers
  • Clear separation between data and applications

Demo 1, October 2015

It works... for a bit

  • Is it a hardware issue with new servers?
  • Is Docker too new and unreliable?
  • Bug in the software?
  • Something else?

First lesson: Verify your infrastructure at every step

Make sure all layers of your stack are reliable

  • Don't try too many new things at once
  • Virtualisation is convenient and powerful but you need to trust all layers

Demo 2, May 2016

OpenStack at EMBL-EBI with Ansible:

  • Relatively easy to setup
  • Just requires SSH access to servers
  • Everything configured with YAML

One command can provision new servers, configure networking and storage, and install the IDR, reproducibly: IDR/deployment

It works... for a bit

We have a scaling problem, OMERO just wasn't designed for the amount of data and frequency of access. These problems only occur on a big system like the IDR.

How do you debug and test a 50+ (now 100+) TB system?


Official release: April 2017

Where are we now?

Who's working on the IDR?

Sebastien Besson
Jean-Marie Burel
Mark Carroll
David Gault
Riad Gozim
Simon Li
Dominik Lindner
Melissa Linkert
Josh Moore
Will Moore
Petr Walczysko
Frances Wong

Curation: A critical factor in the success of the IDR (remember: FAIR)

  • Metadata for millions of images is manually curated
  • Constant stream of new datasets

This is one of the largest public bioimage publication systems running in the world

  • Publically available
  • Recommended by journals as a data repository
  • Whole stack is available on GitHub
40,000 visitors in the past 12 months

IDR/deployment IDR/idr-log-analysis

Jason Swedlow
Sebastien Besson
Jean-Marie Burel
Mark Carroll
David Gault
Riad Gozim
Simon Li
Dominik Lindner
Melissa Linkert
June Matthew
Josh Moore
Will Moore
Petr Walczysko
Frances Wong
Rafael Carazo-salas
Alvis Brazma
Ugis Sarkans
Simon Jupp
Tony Burdett
Aleksandra Tarkowska
Anatole Chessel
Richard Ferguson
Helen Flynn
Kenny Gillen
Roger Leigh
Simone Leo
Gabriella Rustici
Eleanor Williams