Adding course system troubleshooting and performance (#114)

Co-authored-by: Himanshu Chandwani <hchandwa@hchandwa-ld2.linkedin.biz>
This commit is contained in:
Himanshu Chandwani
2021-08-05 20:54:50 +05:30
committed by GitHub
parent 4b1c22ec44
commit 95b5e64cfb
17 changed files with 275 additions and 8 deletions

View File

@@ -0,0 +1,12 @@
Complex systems have many factors which can go wrong. It can be a bad design & architecture, poorly managed code, poor policies around different caches, bad DB queries or architecture, improper use of resources, or bad OS version, poorly monitored system, datacenter issues, network faults, and many more, Any of these can go wrong.
As an SRE, Knowing important tools/commands, best practices, profiling, benchmarking and scaling can help you with faster troubleshooting and performance improvement of the overall system.
## Further readings
Here are some links from the LinkedIn Engineering Blog, as written by LinkedIn engineers, about firefighting they did, ensuring site up 24x7x365.
- [Taming memory fragmentation in Venice with Jemalloc](https://engineering.linkedin.com/blog/2021/taming-memory-fragmentation-in-venice-with-jemalloc)
- [Intro: Every Day Is Monday in Operations](https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason)
- [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions)
- [The impact of slow NFS on data systems](https://engineering.linkedin.com/blog/2020/the-impact-of-slow-nfs-on-data-systems)