A Dashboard Is Worth a Thousand Words: Better Monitoring for Better Ops
Luca Magnoni, CERN
Not everyone is doing SRE. Consider a large scale scientific organisation with decades of experience in distributed systems and IT service operations, it may have a solid well-established ops culture and still benefit from the adoption of some of the new concepts and practises that SRE defined in the recent years. This is the story on how the creation of a new monitoring system, gathering together metrics and logs for infrastructure and services, based on a well known technologies stack (e.g. Kafka, Grafana, InfluxDB, Elasticsearch) lead not only to better service operations but also to raise awareness toward SRE practises and culture among service managers. The talk will discuss the design decisions, the operational challenges in building and scaling the system up to tens of thousands of hosts and the strategy adopted to enhance the monitoring practises, introducing concepts as SLI/SLO and the benefits derived.
View the full SREcon19 Asia/Pacific program at [ Ссылка ].
Sign up to find out more about SREcon at [ Ссылка ]
Ещё видео!