Reliability Manager Interview Questions

3,999 reliability manager interview questions shared by candidates

Software Engineering (pick whatever language you are most comfortable in) a. Utilize Blizzard Hearthstone API (See: Getting started and API guides) to retrieve card data i. For the purposes of this exercise, a proper secret management mechanism is not required. Use of an industry standard solution (Vault, Parameter Store, etc.) can be implied in your documentation. ii. This does not need to be a running service hosted somewhere. We will review your submission and discuss your strategies. b. Create web application to render requested information from the API into a human readable page i. Retrieve details of any 10 cards with the following criteria 1. Class: Druid OR Warlock 2. Mana: At least 7 3. Rarity: Legendary ii. Display results sorted by card ID in a human readable table that includes: · Card image · Name · Type · Rarity · Set · Class a. Provide link to repository for application source and documentation 2. Reliability Engineering a. Examine UML sequence diagram of a proposed new service b. Describe the following i. What is the critical user path? ii. Propose graceful degradation of individual components iii. Propose any relevant SLAs, SLIs, and SLOs for the application 1. Describe how/where/what you would measure and propose initial SLO/SLA values and their justification for a business critical service. 2. Describe what metric types or events you use in your SLIs and include any filtering criteria. a. The service is now live but your monitoring of one of the critical software components indicates the error budget burn rate is excessive. You approach the development team and determine they have an aggressive deployment cadence and a number of pushes have had issues. What do you suggest and discuss with your partner team? UML Diagram code @startuml title "Login flow" actor "Client" as A boundary "Edge service" as B participant "Auth" as C participant "Entitlement" as D participant "Account" as E participant "Game Data" as F participant "Recommendations" as G participant "Friends" as H participant "Presence" as I Note over A: Customer device Note over B: Request broker A->>B: Custom RPC\npersistent connection B->>C: Verify credentials\nand request token C->>B: Confirms credentials\nand supplies token Note over B: User session created B->>A: Notify login success A->>B: Request game titles\nfor UI display B->>D: REST API request\ntitles for user_id D->>E: REST API request\nfor titles owned by user_id E-->>D: D->>F: REST API request for\nmeta-data and CDN asset\nlinks for each title F-->>D: D->>G: REST API request for\nrecommendations using\nlist of owned titles Note over G: Generates custom recommendations G-->>D: D->>B: JSON payload containing\nowned and recommended titles B->>A: JSON payload passed\nthrough to client Note over A: Displays lists of owned\nand recommended\ntitles and any promotions A->>B: Request for friends data B->>H: REST request for friends data Note over H: Queries DB for list\nof friends and online status H->>I: REST request for\ncurrent activity\nof each friend I-->>H: H->>B: JSON payload containing friends and their status B->>A: JSON payload passed\nthrough to client Note over A: Payload parsed and\nfriends UI element populated @enduml 3. Systems Internals a. Scenario: Your partner development team comes to you to help them with some performance related issues. They say, according to their calculations and local testing, they should be able to be able to achieve a much higher request rate before they reach their thread limits. However, in their staging environment, they notice the average request takes significantly longer than it should to process so the number of in-flight requests quickly climbs and exhausts system resources. The developer says, when they test locally, the service responds consistently with single-digit latency. b. Utilize the following data sources to determine why this is happening and make any necessary recommendations. i. Service diagram from the development team UML Service diagram @startuml title "Translatatron" actor "Load tester" as A boundary "Application server" as B participant "Redis" as C A->>B: HTTP Note over B: Application translates\nEnglish to Orcish B->>C: Metadata for real-time\napplication analytics B->>A: Response @enduml Attachment: th3.zip ii. Application source code (th3-server.py) iii. PCAP from server application instance (th3-server.pcap) iv. HTTP request logs from the server application
avatar

Site Reliability Engineer

Interviewed at Blizzard Entertainment

3.6
Apr 20, 2021

Software Engineering (pick whatever language you are most comfortable in) a. Utilize Blizzard Hearthstone API (See: Getting started and API guides) to retrieve card data i. For the purposes of this exercise, a proper secret management mechanism is not required. Use of an industry standard solution (Vault, Parameter Store, etc.) can be implied in your documentation. ii. This does not need to be a running service hosted somewhere. We will review your submission and discuss your strategies. b. Create web application to render requested information from the API into a human readable page i. Retrieve details of any 10 cards with the following criteria 1. Class: Druid OR Warlock 2. Mana: At least 7 3. Rarity: Legendary ii. Display results sorted by card ID in a human readable table that includes: · Card image · Name · Type · Rarity · Set · Class a. Provide link to repository for application source and documentation 2. Reliability Engineering a. Examine UML sequence diagram of a proposed new service b. Describe the following i. What is the critical user path? ii. Propose graceful degradation of individual components iii. Propose any relevant SLAs, SLIs, and SLOs for the application 1. Describe how/where/what you would measure and propose initial SLO/SLA values and their justification for a business critical service. 2. Describe what metric types or events you use in your SLIs and include any filtering criteria. a. The service is now live but your monitoring of one of the critical software components indicates the error budget burn rate is excessive. You approach the development team and determine they have an aggressive deployment cadence and a number of pushes have had issues. What do you suggest and discuss with your partner team? UML Diagram code @startuml title "Login flow" actor "Client" as A boundary "Edge service" as B participant "Auth" as C participant "Entitlement" as D participant "Account" as E participant "Game Data" as F participant "Recommendations" as G participant "Friends" as H participant "Presence" as I Note over A: Customer device Note over B: Request broker A->>B: Custom RPC\npersistent connection B->>C: Verify credentials\nand request token C->>B: Confirms credentials\nand supplies token Note over B: User session created B->>A: Notify login success A->>B: Request game titles\nfor UI display B->>D: REST API request\ntitles for user_id D->>E: REST API request\nfor titles owned by user_id E-->>D: D->>F: REST API request for\nmeta-data and CDN asset\nlinks for each title F-->>D: D->>G: REST API request for\nrecommendations using\nlist of owned titles Note over G: Generates custom recommendations G-->>D: D->>B: JSON payload containing\nowned and recommended titles B->>A: JSON payload passed\nthrough to client Note over A: Displays lists of owned\nand recommended\ntitles and any promotions A->>B: Request for friends data B->>H: REST request for friends data Note over H: Queries DB for list\nof friends and online status H->>I: REST request for\ncurrent activity\nof each friend I-->>H: H->>B: JSON payload containing friends and their status B->>A: JSON payload passed\nthrough to client Note over A: Payload parsed and\nfriends UI element populated @enduml 3. Systems Internals a. Scenario: Your partner development team comes to you to help them with some performance related issues. They say, according to their calculations and local testing, they should be able to be able to achieve a much higher request rate before they reach their thread limits. However, in their staging environment, they notice the average request takes significantly longer than it should to process so the number of in-flight requests quickly climbs and exhausts system resources. The developer says, when they test locally, the service responds consistently with single-digit latency. b. Utilize the following data sources to determine why this is happening and make any necessary recommendations. i. Service diagram from the development team UML Service diagram @startuml title "Translatatron" actor "Load tester" as A boundary "Application server" as B participant "Redis" as C A->>B: HTTP Note over B: Application translates\nEnglish to Orcish B->>C: Metadata for real-time\napplication analytics B->>A: Response @enduml Attachment: th3.zip ii. Application source code (th3-server.py) iii. PCAP from server application instance (th3-server.pcap) iv. HTTP request logs from the server application

Viewing 141 - 150 interview questions

Glassdoor has 3,999 interview questions and reports from Reliability manager interviews. Prepare for your interview. Get hired. Love your job.