Open role

Site Reliability Engineer (SRE)

San Francisco, CA (On-site) · Full-time

As a Site Reliability Engineer, you will focus on the stability and efficiency of our platform in production. You’ll apply software engineering principles to ensure that our systems are robust, automated, and performant. Your mission is to make our infrastructure self-sustaining and resilient, so that our customers can rely on Methodic’s service at all times. This includes maintaining near-zero downtime, fast performance, and rapid recovery in case of any incidents.

Responsibilities

Develop and maintain advanced monitoring, alerting, and self-healing mechanisms that detect and address issues before they impact customers.
Perform regular capacity planning and load testing to ensure the platform can scale ahead of demand without performance degradation (validating that we can sustain extremely high transaction rates per tenant).
Improve deployment processes with strategies like blue-green or canary deployments to minimize risk and downtime during releases.
Collaborate with software engineers to design resilient architectures – for example, build redundancy and failover capabilities into critical services (so that even if one component fails, the system remains operational).
Participate in on-call rotations to respond to and resolve production incidents; lead blameless post-mortems to identify root causes and implement corrective actions.
Automate routine operational tasks (from simple scripts to more complex tooling) to reduce manual work and error potential.
Document reliability-related procedures and best practices, ensuring knowledge is shared and systems are well-understood by the team.

Requirements

5+ years in a Site Reliability Engineering or similar role.
Strong coding/scripting abilities (Python, Go, or other) for building automation and tooling.
Deep knowledge of systems monitoring and observability — experience with tools like Grafana, Datadog, Prometheus, etc., and the ability to interpret system metrics to spot problems.
Understanding of high-availability design and distributed systems principles (load balancing, consensus, graceful degradation, etc.).
Experience with incident management and a track record of improving systems based on lessons learned.
Performance tuning experience — ability to use profiling and stress testing tools to find and fix bottlenecks.
A collaborative mindset, capable of working with development teams to ensure reliability is built in from the start, not just after issues occur.

Submit your application

Provide a few details and our hiring team will reach out with next steps.

Candidate details

Optional, max 5MB.

We email a confirmation to the hiring team.