Sre Interview Questions

1,929 sre interview questions shared by candidates

1. Monitoring & Observability What are the key metrics you monitor to ensure service reliability? How do you prioritize them? Can you explain the difference between monitoring, logging, and tracing, and give an example of when you’d use each? Describe a time when you set up monitoring or alerting for a critical system. What were the challenges, and how did you address them? 2. Incident Management & Troubleshooting What’s your approach to diagnosing and resolving a high-severity incident? Can you walk me through an example? How do you conduct post-incident reviews to prevent recurrence, and what do you look for? Explain how you would handle an incident where latency suddenly spikes for a critical application. What steps would you take? 3. Automation & Tooling How do you identify opportunities for automation in daily tasks? Give an example of a repetitive task you automated. What tools have you used for automating infrastructure deployment and configuration management? Explain how you would approach building a self-healing system. What tools and practices would you use? 4. Scalability & Performance How would you design a system to handle high traffic loads while maintaining low latency? Can you explain the concept of horizontal vs. vertical scaling and when you would use each? Describe an instance when you helped optimize a system for scalability. What methods did you employ? 5. Reliability & Availability What are Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs), and why are they important in SRE? How would you handle a situation where you’re nearing your error budget for the quarter? What are some common trade-offs you consider when balancing reliability with system performance and cost?
avatar

SRE Engineer

Interviewed at IBM

3.9
Oct 14, 2024

1. Monitoring & Observability What are the key metrics you monitor to ensure service reliability? How do you prioritize them? Can you explain the difference between monitoring, logging, and tracing, and give an example of when you’d use each? Describe a time when you set up monitoring or alerting for a critical system. What were the challenges, and how did you address them? 2. Incident Management & Troubleshooting What’s your approach to diagnosing and resolving a high-severity incident? Can you walk me through an example? How do you conduct post-incident reviews to prevent recurrence, and what do you look for? Explain how you would handle an incident where latency suddenly spikes for a critical application. What steps would you take? 3. Automation & Tooling How do you identify opportunities for automation in daily tasks? Give an example of a repetitive task you automated. What tools have you used for automating infrastructure deployment and configuration management? Explain how you would approach building a self-healing system. What tools and practices would you use? 4. Scalability & Performance How would you design a system to handle high traffic loads while maintaining low latency? Can you explain the concept of horizontal vs. vertical scaling and when you would use each? Describe an instance when you helped optimize a system for scalability. What methods did you employ? 5. Reliability & Availability What are Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs), and why are they important in SRE? How would you handle a situation where you’re nearing your error budget for the quarter? What are some common trade-offs you consider when balancing reliability with system performance and cost?

No.
avatar

SRE

Interviewed at Google

4.4
Jun 7, 2013

No.

Viewing 1741 - 1750 interview questions

Glassdoor has 1,929 interview questions and reports from Sre interviews. Prepare for your interview. Get hired. Love your job.