[Remote] Senior Site Reliability Enigneer
Note: The job is a remote job and is open to candidates in USA. Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. They are seeking a dedicated Senior Site Reliability Engineer to take ownership of operational excellence across Cloud Infrastructure, focusing on incident management, automation, and vendor relationships.
Responsibilities
- Incident management & operational excellence — take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve
- Automation & reliability engineering — automate low-frequency, high-consequence operations (the certificate-renewal class of problem — rare, easy to forget, outage-causing when missed), not just the high-frequency toil. You decide what to automate based on risk and blast radius, not just time saved
- A platform domain — over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it
- Vendor & third-party management — own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. Bring structure, automation, and bus-factor resilience
- FinOps — own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing
Skills
- Strong production operations experience on AWS and Kubernetes; comfortable with MongoDB and scripting/automation in Python
- An operations-and-reliability mindset — you take pride in systems that run quietly — paired with the instinct to engineer the problem away rather than absorb it manually
- Sound judgement on incidents and risk; calm and clear under pressure
- Influences through relationships and evidence, not escalation; comfortable owning a domain and partnering across teams
- Bonus: vendor/cost management exposure, Temporal, observability tooling
Company Overview