All roles

[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company focused on providing high-performance infrastructure for AI. They are seeking a Principal Site Reliability Engineer to lead the reliability strategy and operational excellence for their AI Infrastructure Operations team, ensuring the scalability and reliability of their demanding AI platforms.

Responsibilities

  • Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
  • Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
  • Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
  • Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
  • Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
  • Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
  • Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
  • Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

Skills

  • 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure
  • Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
  • Deep expertise in Linux, networking, and distributed systems design at scale
  • Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
  • Proven ability to lead technical initiatives across teams without direct authority
  • Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost
  • Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)
  • Experience designing observability systems for high-cardinality, high-throughput environments
  • Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
  • A history of driving step-change improvements in reliability, scalability, or operational efficiency

Benefits

  • Bonus
  • Equity
  • Commission programs
  • Medical
  • Dental
  • Vision
  • Flexible paid time off
  • Parental leave
  • Retirement plan participation
  • Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
  • Join our thriving remote-first team . Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Company Overview

  • Nscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.
  • Apply To This Job

    Related roles

    [Remote] Senior Captive Account Manager

    Remote · USA Full-time

    [Remote] Senior Operations Associate, Office of the CEO

    Remote · USA Full-time

    [Remote] Senior Director, Clinical Development - Antibacterials, Antifungals, CMV & Covid 19

    Remote · USA Full-time

    [Remote] Account Executive, Profile and Engagement Product Specialist

    Remote · USA Full-time

    [Remote] Program Manager, C&P Strategy and Business Operations

    Remote · USA Full-time

    [Remote] Staff Frontend Engineer

    Remote · USA Full-time

    [Remote] Enterprise Account Executive, US

    Remote · USA Full-time

    [Remote] MDW Project Manager

    Remote · USA Full-time

    [Remote] Sales Operations Manager

    Remote · USA Full-time

    [Remote] Data Center Project Manager - Special Projects

    Remote · USA Full-time

    Field Talent Acquisition Recruiter

    Remote · USA Full-time

    Production Planner

    Remote · USA Full-time

    Apply Now: Urgently Need Corps Member Learning Coach

    Remote · USA Full-time

    Relationship Manager - Bethesda, MD

    Remote · USA Full-time

    Benefit Customer Service Representative - Seasonal Colleague - Non Bilingual (Mt. Laurel) | 2025

    Remote · USA Full-time

    [Remote] Senior Analytics Engineer

    Remote · USA Full-time

    Regional Manager | Northern Midwest

    Remote · USA Full-time

    Experienced CDP Product Leader - Customer Data Platform Technical Expert | Real-time Marketing Solutions & Data Strategy

    Remote · USA Full-time

    Principal Software Engineer, Site Reliability Engineer (Remote)

    Remote · USA Full-time

    Dedicated and Compassionate Paraeducator Special Education - Highly Qualified Professional Needed to Support Students with Disabilities in a Dynamic and Inclusive Educational Environment

    Remote · USA Full-time