All roles

[Remote] DevOps Engineer - Atlanta, GA, Birmingham, AL, Louisville, KY, Richmond, VA, Charlotte, NC

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Dice is seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with expertise in Incident Management and cloud-native platforms. The role involves ensuring the reliability and performance of distributed systems, managing incident responses, and implementing automation and governance strategies.

Responsibilities

  • Manage and improve platform reliability, availability, and performance across production environments
  • Lead and participate in incident management, root cause analysis, remediation planning, and post-incident reviews
  • Drive change control processes and ensure operational governance standards are followed
  • Monitor and manage error budgets while implementing reliability improvements
  • Design, build, and maintain scalable cloud infrastructure and automation frameworks
  • Deploy and manage containerized applications using Kubernetes and Docker
  • Develop and maintain CI/CD pipelines to support efficient software delivery
  • Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management
  • Establish observability strategies using monitoring, logging, and alerting platforms
  • Collaborate with development, infrastructure, security, and business teams to ensure platform stability
  • Troubleshoot complex production issues across cloud, networking, infrastructure, and application layers
  • Continuously improve operational processes, automation, and system resilience

Skills

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production Operations
  • Strong experience managing workloads in cloud environments: Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (Google Cloud Platform)
  • Hands-on experience with: Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC)
  • Strong scripting and automation expertise using: Python, Bash, PowerShell, Go (Golang)
  • Experience with observability and monitoring platforms: Datadog, Grafana, Prometheus, Splunk
  • Strong understanding of: Networking concepts, Linux Administration, Windows Administration, Distributed Systems, Cloud-Native Architectures
  • Experience with: Incident Response, Production Troubleshooting, Operational Governance
  • Experience implementing reliability engineering best practices and SRE methodologies
  • Experience supporting large-scale enterprise production environments
  • Familiarity with high-availability and disaster recovery architectures
  • Experience automating operational workflows and infrastructure management
  • Knowledge of security best practices within cloud environments
  • Experience working in Agile and DevOps-driven organizations

Company Overview

  • Dice is a job-searching platform for technology professionals. It is a sub-organization of DHI Group. It was founded in 1990, and is headquartered in Santa Clara, California, USA, with a workforce of 201-500 employees. Its website is http://www.dice.com.
  • Apply To This Job

    Related roles