mirror of
https://github.com/linkedin/school-of-sre
synced 2026-01-03 23:28:03 +00:00
* docs: formatted for readability * docs: rephrased and added punctuation * docs: fix typos, punctuation, formatting * docs: fix typo and format * docs: fix caps and formatting * docs: fix punctuation and formatting * docs: capitalized SQL commands, fixed puntuation, formatting * docs: fix punctuation * docs: fix punctuation and formatting * docs: fix caps,punctuation and formatting * docs: fix links, punctuation, formatting * docs: fix code block formatting * docs: fix punctuation, indentation and formatting
52 lines
2.2 KiB
Markdown
52 lines
2.2 KiB
Markdown
# Conclusion
|
|
|
|
A robust monitoring and alerting system is necessary for maintaining and
|
|
troubleshooting a system. A dashboard with key metrics can give you an
|
|
overview of service performance, all in one place. Well-defined alerts
|
|
(with realistic thresholds and notifications) further enable you to
|
|
quickly identify any anomalies in the service infrastructure and in
|
|
resource saturation. By taking necessary actions, you can avoid any
|
|
service degradations and decrease MTTD for service breakdowns.
|
|
|
|
In addition to in-house monitoring, monitoring real-user experience can
|
|
help you to understand service performance as perceived by the users.
|
|
Many modules are involved in serving the user, and most of them are out
|
|
of your control. Therefore, you need to have real-user monitoring in
|
|
place.
|
|
|
|
Metrics give very abstract details on service performance. To get a
|
|
better understanding of the system and for faster recovery during
|
|
incidents, you might want to implement the other two pillars of
|
|
observability: logs and tracing. Logs and trace data can help you
|
|
understand what led to service failure or degradation.
|
|
|
|
Following are some resources to learn more about monitoring and
|
|
observability:
|
|
|
|
- [Google SRE book: Monitoring Distributed
|
|
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
|
|
|
|
- [Mastering Distributed Tracing by Yuri
|
|
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
|
|
|
|
|
|
## References
|
|
|
|
- [Google SRE book: Monitoring Distributed
|
|
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
|
|
|
|
- [Mastering Distributed Tracing, by Yuri
|
|
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
|
|
|
|
- [Monitoring and
|
|
Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c)
|
|
|
|
- [Three PIllars with Zero
|
|
Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8)
|
|
|
|
- Engineering blogs on
|
|
[LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring),
|
|
[Grafana](https://grafana.com/blog/),
|
|
[Elastic.co](https://www.elastic.co/blog/),
|
|
[OpenTelemetry](https://medium.com/opentelemetry)
|