[Remote] Senior Manager, Engineering - Observability Platform (Remote Eligible)
Note: The job is a remote job and is open to candidates in USA. Smartsheet is a company that empowers teams to manage work seamlessly and scale solutions smarter. They are seeking a Senior Manager of Engineering for the Observability Platform team to lead the development of a centralized platform that provides full-stack visibility into complex systems across the company. This role will involve engineering strategy and execution, focusing on observability tooling and AI integrations to enhance platform reliability.
Responsibilities
- Lead a team of engineers focused on observability platform engineering, driving build-out of a unified observability stack used by all engineering teams at Smartsheet
- Own and evolve the platform's technical roadmap, consolidating multiple tooling platforms, and AI observability tooling into a coherent, scalable capability
- Define platform standards, contribute to architectural direction, and ensure the team operates with engineering rigor and strong operational habits
- Build and scale the team, hiring senior engineers and establishing effective global practices across distributed stakeholders
- Lead design and delivery of centralized observability infrastructure covering metrics pipelines, distributed tracing, alerting frameworks, and log analytics across Smartsheet services
- Drive SLO/SLA definition and tooling for platform-wide reliability visibility, partnering closely with infrastructure, platform engineering, and on-call teams
- Own governance including instrumentation standards, cost optimization, and rollout of advanced capabilities such as APM, RUM, and custom dashboards
- Lead architecture, scaling, and operational practices for log analytics across high-throughput production workloads
- Establish shared observability libraries, agents, and SDKs that reduce instrumentation burden for application engineering teams
- Build and maintain AI/ML observability integrations in partnership with the AI Platform team
- Partner with the Data & AI Platform team to integrate MLflow tracing, Inference Tables, and LLM-as-judge evaluation pipelines into the observability stack
- Develop dashboards and alerting for agentic AI workloads, including latency, token consumption, error rates, and evaluation metric drift
- Contribute to the AI governance and cost observability program, providing telemetry for model usage, cost attribution, and compliance reporting
- Serve as the primary engineering partner for platform consumers across Data & AI, Commerce, Infrastructure, and Security teams, ensuring observability needs are met across workstreams
- Lead complex, cross-functional observability projects with high ambiguity, managing delivery risk, communicating clearly to senior stakeholders, and building alignment across teams
- Partner with delivery partners to coordinate instrumentation across platform modernization and migration workstreams
- Contribute to quarterly and annual platform goals, reporting on key reliability and observability metrics to engineering leadership
- Communicate platform status, risks, and roadmap progress to Engineering leadership and above audiences in a clear, executive-ready format
- Embed on-call culture and incident management discipline into the team, ensuring clear runbooks, fast MTTR, and post-incident learning loops
- Drive cost governance for observability tooling, including spend optimization and efficient resource management
- Champion AI-assisted engineering practices within the team, applying tooling and automation to reduce toil and accelerate delivery
Skills
- 10+ years of software or platform engineering experience, with strong fundamentals in distributed systems, infrastructure, and backend services
- 3 years of engineering management experience, including direct team building, performance management, and cross-functional delivery ownership
- Deep hands-on expertise with observability tooling: Datadog (APM, metrics, logs, alerting), OpenSearch or Elasticsearch, distributed tracing (OpenTelemetry or equivalent), and SLO/SLA management at scale
- Proven experience operating observability platforms for high-availability, high-throughput production environments
- Experience building and scaling engineering teams in distributed or international focus
- Strong execution track record on complex, cross-functional infrastructure programs with high ambiguity
- Clear, direct communication (written and verbal) with both technical and non-technical audiences, including leadership and executive stakeholders
- Proactive risk identification and status communication without prompting
- Experience managing vendors, external delivery partners, and third-party integrations in a platform context
- Hands-on experience with AI/ML observability: MLflow tracing, LLM evaluation pipelines, or observability for agentic AI systems
- Familiarity with Amazon Bedrock, ECS Fargate, or LangGraph-based multi-agent architectures
- Experience with cloud cost governance and FinOps practices for observability tooling
- Exposure to data platform observability and data quality monitoring in a lakehouse context
- Experience establishing internal developer platforms, shared libraries, or platform-as-a-service offerings for application teams
- Prior work in SaaS environments with enterprise compliance requirements (SOC 2, FedRAMP, HIPAA)
Benefits
- Employer subsidized medical/vision and dental coverage for full-time employees
- 401k Match to help you save for your future (50% of your contribution up to the first 6% of your eligible pay)
- Monthly stipend to support your work and productivity
- Flexible Time Away Program, plus Sick Time Off
- US employees are automatically covered under Smartsheet-sponsored life insurance, short-term, and long-term disability plans
- US employees receive 12 paid holidays per year
- Up to 24 weeks of Parental Leave
- Personal paid Volunteer Day to support our community
- Opportunities for professional growth and development including access to Udemy online courses
- Company Funded Perks, including a counseling membership, local retail discounts, and your own personal Smartsheet account
- Teleworking options from any registered location in the U.S. (role specific)
Company Overview