How to build and run an international open-data image repository

Simon Li
Open Microscopy Environment
University of Dundee

### Abstract Image Data Resource (IDR, https://idr.openmicroscopy.org) is a public data repository containing over 100TB of life sciences imaging data from published studies in a searchable and reusable format. It is built from existing open-source tools but significant work was required to deploy and keep it running as a production service. I will talk about the journey from our first ever use of cloud services to scaling up a single- server system into the reliable public resource that exists today, including design choices, the mistakes we made, and the challenges we still face. I will introduce some of the tools we use including Ansible, OpenStack, Docker and Kubernetes, but with a focus on the benefits of reproducible deployments rather than going into too much technical detail of particular tools. All of our infrastructure is open source and I will explain why, and provide links for people interested in finding out more. This talk will hopefully provide an insight into how a public data services like this could be set up by your institution, including many of the considerations you may not have thought of.

Overview

Who, What, Why
Current deployment
How did we get here?
Where are we now?

Who am I?

Software engineer and sysadmin at the OME since 2012

Open Microscopy Environment

A consortium of universities, research labs, industry and developers producing open-source software and standards for microscopy data.

Why do we need the IDR?

Open-science

Open-data
Open-access
Open infrastructure
Don't build new databases from scratch, instead re-use and build upon existing work.

Data should be

Findable

Accessible

Interoperable

Reusable

A set of guidelines for publishing scientific data endorsed by G20 leaders
But this is just the minimum

Make it easy for others to use your data

Imaging data is complicated

No it's not

Yes it is!

100s of proprietary file formats in the Life sciences
Many different imaging modalities and scales

Single molecule

Tissue

351 Gigapixels

Timelapse

3D lightsheet

27 channels

96 well plate

384 well plate

➕Custom metadata such as experimental infomation

A public repository for reference datasets and images of interest to a broader community of users

Provides the expertise needed to curate and publish life-science imaging data effectively

From this ▶▶▶

▶▶▶ idr.openmicroscopy.org

IDR in numbers (August 2019)

5,341,278 images

19,076,141 files

125 TB

Behind the scenes

OpenStack private cloud
EMBL-EBI

Main components

OMERO: Open-source enterprise platform for managing imaging data in the Life Sciences

Bio-Formats: used by OMERO to read and write over 150 image formats including metadata

PostgreSQL: used by OMERO.server to store all metadata and file information

Nginx: Loadbalancer proxy and caching

JupyterHub: Co-located platform for analysing data in the IDR

Storage: Mix of cloud (read-write) and NFS (read-only)

Data submission and curation

Average per submission

91,000 images
323,000 files
2.1 TB

How did we get here?

Early 2015 Work started

▶

October 2015 Demo 1

▶

May 2016 Demo 2

▶

April 2017 Official launch

What we started with

Used by 100s of institutions around the world

Current IDR: 125 TB

↳ At the time University of Dundee server held around 12 TB data

Everything done manually


                  $ ssh idr.server

                  # yum install java-1.8.0-openjdk
                  # yum install python-{pip,devel,virtualenv,yaml,jinja2,tables}
                  # ...

What was new?

New hardware: Servers and IBM Spectrum (GPFS) storage array
New way of working

Infrastructure as code

Apply the software development process to managing servers
Clear separation between data and applications

Demo 1, October 2015

It works... for a bit

Is it a hardware issue with new servers?
Is Docker too new and unreliable?
Bug in the software?
Something else?

First lesson: Verify your infrastructure at every step

Make sure all layers of your stack are reliable

Don't try too many new things at once
Virtualisation is convenient and powerful but you need to trust all layers

Demo 2, May 2016

OpenStack at EMBL-EBI with Ansible:

Relatively easy to setup
Just requires SSH access to servers
Everything configured with YAML

One command can provision new servers, configure networking and storage, and install the IDR, reproducibly: IDR/deployment

It works... for a bit

We have a scaling problem, OMERO just wasn't designed for the amount of data and frequency of access. These problems only occur on a big system like the IDR.