Nikola Dipanov, Facebook
Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE organization. Monitoring services the right way can have a profound impact on how we do SRE. Modern software systems can be incredibly complex, code running on thousands of machines, depending on services we don't control and running code on user devices. Observing behavior of such systems means we have to change how we think about monitoring.
This talk will go over what a modern monitoring infrastructure for running software at scale looks like:
Asking the right questions—how to decide what to monitor
Types of data we want to collect and what answers it can help us find
A look at how we build services at Facebook
Collecting, storing and querying monitoring data at scale
When things go wrong—what makes for a good alarm and what makes a bad one
Putting it all together—debugging an outage using data
As an attendee, you will come out of the talk with fresh ideas about logging and monitoring. You will hear how we tackle these problems at Facebook, and why we do things the way we do.
View the full SREcon18 Asia/Australia Program at [ Ссылка ]
Sign up to find out more about SREcon at [ Ссылка ]
Ещё видео!