As a Senior Site Reliability Operations Engineer on the Reliability Team you will manage production incidents and improve Robloxs incident processes while reporting to the Senior Operations Manager. You will maintain reliability service-level objectives drive incidents tenaciously to resolution and work with service teams towards appropriate action items during the incident postmortem process. If you are passionate about maintaining uptime in a complex distributed environment full of continuous change youll be right at home with our Reliability will report to the Senior Manager Reliability Response.You Will:Lead and manage production incidents.Collaborate cross-functionally to troubleshoot and resolve sophisticated technical challenges.Guide the implementation of incident management processes and procedures ensuring fast and effective responses to minimize impact.Continually monitor system health performance and capacity proactively addressing potential issues.Conduct comprehensive post-mortem analysis to ascertain the root cause of incidents and formulate corrective measures.Contribute substantially to the design and enhancement of system architecture to boost reliability and performance.Leverage coding skills to automate daily routine tasks and enhance system efficiency.Serve in the Incident Manager On-Call rotation.Mentor junior team members.You Have:At least 8 years of experience in a comparable role within a Site Reliability Team.Advanced knowledge of systems and network infrastructure protocols.Demonstrated ability in managing troubleshooting and resolving incidents in distributed environments.Experience solving problems.An ability to distill complex technical issues into clear and concise language.Familiarity with at least one scripting or programming language to automate routine tasks (Python Golang or similar languages preferred).Bachelors degree or equivalent experience in Computer Science Computer Engineering or a similar technical fieldYou Are:A great communicator; you are able to explain complex systems clearly to stakeholders and fellow engineers.Able to operate in potentially ambiguous circumstances during a production incident.Familiar with the interactions of services in a distributed system.Tenacious towards driving challenging production incidents to resolution.Required Experience:Staff IC Key Skills Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting Employment Type : Full Time Experience: years Vacancy: 1

Principal Site Reliability Operations Engineer at Roblox

Job Description

Resume Suggestions

Explore More Opportunities