OMERO Ops

OME Users meeting 2018

What, Why, How?

  • What is happening with my systems?
  • Why has it crashed?
  • How can I avoid it in future?

The solution: Monitoring

AdRem NetCrunch
Airbrake.io
Amazon Cloudwatch
Amon
Anturis
AppDynamics
Appensure
Appoptics
Appsignal.com
Bijk
Bluestripe
BMC TrueSighe Pluse
Bosun
Cabotapp
Cacti
Capasystems
CA Technologies
Check_MK
Cisco Cloud Consumption Service
Collectd
Coscale
Count.ly
Datadog
Dynatrace
Eventsentry
Gear5
getplatypus.io
Glances
Grafana
Graphiteapp
Happy Apps
Host Tracker
HPE
IBM
Icinga
Idera
InfluxData (InfluxDB)
Inspectit
instana.com
Instrumental
Instrumental
ITRS
loadview-testing.com
LogicMonitor
Logmatic
Loom Systems
Manageengine Opmanager
Monit
Monitis
Monitorix
MRTG
Munin
My-netdata.io
Nagios
netsil.com
New Relic
Nginx Amplify
Nixstats
Nodequery
ntop
Observium
op5
OpenNMS
Oracle
Outlyer
PA Server Monitor
Pandora FMS
Panopta
Pingdom
Pingdom Server Monitor
plumbr.eu
Prometheus.io
PRTG
RHQ
Riverbed
Rollbar
RRDtool
Scoutapp Realtime
Sematext
Sensuapp.org
Sentry
Serverdensity
Sightline Systems
Solarwinds
Splunk
Sumologic.com
Supervisord
Sysdig.com
ThousandEyes
Trace by RisingStack
TraceView
Unomaly
updown.io
Uptime.com
UptimeRobot
Vityl Monitor
WhatsUpGold
Zabbix
Zenoss

We've used two systems in the OME

Check_MK

What we'll look at:

Logfiles

BASH & Cron

bin/omero sessions

Check_MK

Monitoring OMERO.web - Public User

Monitoring Other Resources

"Local Checks"

Insights from Monitoring

OMERO log files

docs.openmicroscopy.org/latest/omero/developers/logging.html

Counts give a rough idea

https://explainshell.com/explain?cmd=(^)

Cut and Sort can extract context

https://explainshell.com/explain?cmd=(^)

grep "today", send an email

https://gitlab.com/openmicroscopy/incubator/..

Add path to script to OS user root's crontab, 23:55

Who's logged in? sessions who will tell you

docs.openmicroscopy.org/latest/omero/users/cli/sessions.html

Lots of systems to look after: lots to worry about.

Are they all up? Serving images/emails/websites/...?

Any exprience on the team?

AdRem NetCrunch
Airbrake.io
Amazon Cloudwatch
Amon
Anturis
AppDynamics
Appensure
Appoptics
Appsignal.com
Bijk
Bluestripe
BMC TrueSighe Pluse
Bosun
Cabotapp
Cacti
Capasystems
CA Technologies
Check_MK
Cisco Cloud Consumption Service
Collectd
Coscale
Count.ly
Datadog
Dynatrace
Eventsentry
Gear5
getplatypus.io
Glances
Grafana
Graphiteapp
Happy Apps
Host Tracker
HPE
IBM
Icinga
Idera
InfluxData (InfluxDB)
Inspectit
instana.com
Instrumental
Instrumental
ITRS
loadview-testing.com
LogicMonitor
Logmatic
Loom Systems
Manageengine Opmanager
Monit
Monitis
Monitorix
MRTG
Munin
My-netdata.io
Nagios
netsil.com
New Relic
Nginx Amplify
Nixstats
Nodequery
ntop
Observium
op5
OpenNMS
Oracle
Outlyer
PA Server Monitor
Pandora FMS
Panopta
Pingdom
Pingdom Server Monitor
plumbr.eu
Prometheus.io
PRTG
RHQ
Riverbed
Rollbar
RRDtool
Scoutapp Realtime
Sematext
Sensuapp.org
Sentry
Serverdensity
Sightline Systems
Solarwinds
Splunk
Sumologic.com
Supervisord
Sysdig.com
ThousandEyes
Trace by RisingStack
TraceView
Unomaly
updown.io
Uptime.com
UptimeRobot
Vityl Monitor
WhatsUpGold
Zabbix
Zenoss

All the hosts we care about...

Issues across entire estate, at a glance.

One screen to view all issues.

Email alerts out of the box.

Slack alerts with a python script.

Check_MK can monitor if http endpoints are up, grep content, check response size - useful for OMERO.web

Here are some example endpoints. I check HTTP status, response time, and response size.


HTTP webgateway/img_detail
  https://pub-omero.openmicroscopy.org/webgateway/img_detail/1/
  


  5 HTTP webgateway/render_image
  6   https://pub-omero.openmicroscopy.org/webgateway/render_image/1/47/0/?c=1|21:1952$0000FF,2|32:1831$00FF00,3|90:4302$FF0000&m=c&p=normal&ia=0&q=0.9
  


  8 HTTP webgateway/render_image_region
  9   https://pub-omero.openmicroscopy.org/webgateway/render_image_region/2/0/0/?tile=2,3,1,45,15


 11 HTTP webgateway/render_thumbnail
 12   https://pub-omero.openmicroscopy.org/webgateway/render_thumbnail/2/

Example http endpoints for OMERO.web


  2 HTTP webgateway/img_detail
  3   https://pub-omero.openmicroscopy.org/webgateway/img_detail/1/
  4
  5 HTTP webgateway/render_image
  6   https://pub-omero.openmicroscopy.org/webgateway/render_image/1/47/0/?c=1|21:1952$0000FF,2|32:1831$00FF00,3|90:4302$FF0000&m=c&p=normal&ia=0&q=0.9
  7
  8 HTTP webgateway/render_image_region
  9   https://pub-omero.openmicroscopy.org/webgateway/render_image_region/2/0/0/?tile=2,3,1,45,15
 10
 11 HTTP webgateway/render_thumbnail
 12   https://pub-omero.openmicroscopy.org/webgateway/render_thumbnail/2/
                

Public user with sample data enables monitoring of Web

docs.openmicroscopy.org/latest/omero/sysadmins/public.html

Monitor all public OMERO.web endpoints

Fix: openmicroscopy/pull/5699

.. and any other production web resources

Local Checks - omero user sessions - week

https://mathias-kettner.de/checkmk_localchecks.html
OME gitlab monitoring-scripts

Local Checks - omero user sessions - months

https://mathias-kettner.de/checkmk_localchecks.html
OME gitlab monitoring-scripts

Unusual Trends - CPU alerts - iViewer

Fix: openmicroscopy/pull/5550

Unusual Trends - open files

Fix: openmicroscopy/pull/5699

Why Prometheus and Grafana instead of Check_MK?

  • Automatic deployment
  • Horizontal scaling
  • It looks nice!

outreach.openmicroscopy.org/grafana

Example deployment

Discussion ...