[Remote] Datacenter Hardware Operations Technician Lead, Industrial Compute

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. They are seeking a Datacenter Hardware Operations Technician Lead to serve as the senior on-site technical authority for hardware reliability and fleet health at one of OpenAI’s flagship AI campuses. The role involves driving technical triage and resolution of hardware issues, collaborating with various teams, and establishing operational standards for hardware maintenance.

Responsibilities

Serve as OpenAI’s senior on-site hardware operations lead for server, GPU, storage, and rack-level infrastructure
Drive technical triage and resolution of complex hardware failures impacting production systems
Partner with Fleet Health Engineering to investigate recurring hardware issues, identify failure patterns, and improve fleet reliability
Lead root cause analysis (RCA) efforts for critical hardware incidents and develop corrective and preventive action plans
Collaborate with Oracle operations teams and OEM vendors to coordinate repairs, replacements, upgrades, and hardware lifecycle activities
Establish and continuously improve hardware maintenance procedures, operational runbooks, and troubleshooting standards
Analyze hardware failure trends and operational metrics to identify reliability risks and improvement opportunities
Support new hardware introductions, validation activities, and production readiness reviews
Coordinate spare parts strategy and inventory planning with supply chain and operations teams
Partner with Hardware Engineering, Manufacturing, and Infrastructure teams to provide field feedback that improves future platform designs
Develop scalable operational standards and best practices that can be deployed across future Stargate campuses
Mentor technicians and partner teams on advanced troubleshooting methodologies and hardware operational excellence

Skills

8+ years of experience supporting large-scale datacenter hardware infrastructure, with experience in a senior technician, sustaining engineering, or hardware operations leadership role
Deep expertise with server platforms, GPU systems, storage infrastructure, rack integration, and datacenter hardware architecture
Strong experience diagnosing complex hardware failures and leading repair efforts in production environments
Experience conducting root cause analysis and driving long-term corrective actions
Strong understanding of hardware reliability engineering principles and fleet-health management
Proven ability to partner effectively across engineering, operations, manufacturing, and vendor organizations
Comfortable operating independently in high-priority production environments with significant operational responsibility
Excellent written and verbal communication skills with the ability to influence technical and operational decisions
Experience developing operational processes, maintenance standards, and technical documentation
Ability to travel occasionally to support new campus deployments and operational readiness activities
Experience supporting large-scale GPU clusters or AI/ML infrastructure environments
Familiarity with fleet health systems, telemetry platforms, and hardware monitoring tools
Experience with failure analysis methodologies such as FRACAS, RCCA, 5-Why, Fishbone, or FMEA
Knowledge of Linux system administration and hardware validation workflows
Experience supporting hyperscale datacenter operations or HPC environments
Familiarity with server manufacturing, rack integration, or NPI-to-sustaining transitions
Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent experience
Experience applying Environmental Health and Safety (EHS) practices in mission-critical datacenter environments

Company Overview

OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation. It was founded in 2015, and is headquartered in San Francisco, California, USA, with a workforce of 1001-5000 employees. Its website is https://www.openai.com.

Apply To This Job

Apply

[Remote] Datacenter Hardware Operations Technician Lead, Industrial Compute

Related roles

[Remote] Senior Data Engineer

[Remote] Lead AWS Serverless Full Stack Engineer

[Remote] AI Solutions Engineer

[Remote] Manager, Emerging Technologies – Digital Marketing

[Remote] Enterprise System Administrator (Salesforce Admin)

[Remote] Senior Product Manager I – Monetization & Billing

[Remote] Account Executive, Emerging Enterprise

[Remote] Database Administrator

[Remote] Senior Strategic Consultant - AI Solutions, Automation & Analytics

[Remote] Marketing Science Analyst

Experienced Data Entry / General Clerk – Administrative Support & Operations

Remote Customer Service Representative – Home‑Based E‑Commerce Support Specialist at arenaflex

Remote Customer Service & E-Commerce Support Associate – Part-Time Teen Opportunity with arenaflex (Work from Home, Remotica Platform Experience)

Editor

Insurance Sales Associate – 100% Remote, Commission Only

Mental Health Therapist - Illinois

Senior Machine Learning Engineer – Generative AI (Remote)

Manager, Contracts Management (Clinical Trials Division - Americas)

Senior Marketing Technology & Analytics Lead

Remote Opportunity - Take Back Control of Your Time