[Remote] Lead Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Gradle Technologies is an AI-native company focused on transforming software development through their Develocity platform. They are seeking a Lead Site Reliability Engineer to define SRE vision, set operational standards, and ensure reliability across production services while mentoring a growing team.

Responsibilities

Operate and maintain all Develocity instances and supporting services in production
Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs
Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents
Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements
Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets
Identify systemic reliability risks and continuously evolve Develocity’s SaaS operations as the platform and customer base grow
Lead and influence architectural and design reviews to ensure reliability, scalability, and operability
Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows
Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting
Own disaster recovery, backups, and business continuity planning and execution
Partner with engineering leadership to balance feature delivery with reliability and operational excellence
Mentor and coach SREs, supporting technical growth and strong operational practices
Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity
Communicate clearly with customers during incidents and maintenance windows
Optimize performance, resource utilization, and operational costs

Skills

7+ years in SRE, DevOps, or an equivalent role operating production services at scale
Experience leading reliability initiatives across multiple teams or services
Demonstrated ability to influence technical direction without direct authority
Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost
Strong Kubernetes experience in production environments
Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2)
Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform)
Track record of incident management and response in a 24/7 on-call environment
Scripting proficiency (Python, Bash) for automation
Strong written and verbal English communication skills
Experience as a founding or early SRE establishing practices in a growing SaaS organization
Familiarity with Develocity
JVM language experience (Java, Kotlin)
Experience with customer-facing and executive-level incident communications

Benefits

A ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions.
Real ownership of production systems used by engineers at companies you've heard of.
Direct interaction with customers when things go wrong (and when they go right).
A culture that values automation over heroics.
In-person meetings, such as our annual company offsite and team meetings.
Work from home in a remote-first environment.
Competitive salaries and equity grants.

Company Overview

Gradle Technologies is the award-winning developer productivity company behind Gradle Build Tool—one of the most used build systems in the world—and Develocity®, the leading developer observability platform. It was founded in 2014, and is headquartered in San Francisco, California, USA, with a workforce of 51-200 employees. Its website is https://gradle.com/.

Company H1B Sponsorship

Gradle Technologies has a track record of offering H1B sponsorships, with 1 in 2025, 1 in 2024, 2 in 2022. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply

[Remote] Lead Site Reliability Engineer

Related roles

[Remote] Director, Business Operations

[Remote] Associate Environmental Analyst (Agriculture) NY HELPS

[Remote] Online Marketing Specialist - Performance

[Remote] Sales Advisor - PEO Conversion Focus

[Remote] Machinist - Level 3 (CNC Grinder) 1st shift

[Remote] Senior Manager, Go-to-Market Growth Strategy

[Remote] Senior Project Manager (Systemwide/Higher Education/SaaS/ERP)

[Remote] Program Manager (R4930)

[Remote] Senior Program Manager (R4932)

[Remote] Account Manager (Pre-Professional)

Account Manager

Experienced Full Time Chat Support Associate – Live Customer Service Representative for blithequark's Global Client Base

Sourcing Manager - Remote - Strategic Procurement and Supply Chain Leadership at Starbucks

Part Time Remote Data Entry Specialist for Delta Airlines - Flexible Hours and Competitive Salary

Clinical Network Recruiter I (Entry level, remote - $17/hr)

PubSec Client Executive

Part-Time Customer Service Representative for Remote Careers at blithequark - Delivering Exceptional Customer Experiences in Healthcare Services

Experienced Remote Data Entry Specialist – Online Work Opportunity with arenaflex for Detail-Oriented Individuals

ServiceNow Engagement Manager @Advizex (Remote)

Sr Accountant