[Remote] Lead Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Gradle Technologies is an AI-native company focused on transforming software development through their Develocity platform. They are seeking a Lead Site Reliability Engineer to define SRE vision, set operational standards, and ensure reliability across production services while mentoring a growing team.
Responsibilities
- Operate and maintain all Develocity instances and supporting services in production
- Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs
- Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents
- Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements
- Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets
- Identify systemic reliability risks and continuously evolve Develocity’s SaaS operations as the platform and customer base grow
- Lead and influence architectural and design reviews to ensure reliability, scalability, and operability
- Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows
- Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting
- Own disaster recovery, backups, and business continuity planning and execution
- Partner with engineering leadership to balance feature delivery with reliability and operational excellence
- Mentor and coach SREs, supporting technical growth and strong operational practices
- Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity
- Communicate clearly with customers during incidents and maintenance windows
- Optimize performance, resource utilization, and operational costs
Skills
- 7+ years in SRE, DevOps, or an equivalent role operating production services at scale
- Experience leading reliability initiatives across multiple teams or services
- Demonstrated ability to influence technical direction without direct authority
- Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost
- Strong Kubernetes experience in production environments
- Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2)
- Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform)
- Track record of incident management and response in a 24/7 on-call environment
- Scripting proficiency (Python, Bash) for automation
- Strong written and verbal English communication skills
- Experience as a founding or early SRE establishing practices in a growing SaaS organization
- Familiarity with Develocity
- JVM language experience (Java, Kotlin)
- Experience with customer-facing and executive-level incident communications
Benefits
- A ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions.
- Real ownership of production systems used by engineers at companies you've heard of.
- Direct interaction with customers when things go wrong (and when they go right).
- A culture that values automation over heroics.
- In-person meetings, such as our annual company offsite and team meetings.
- Work from home in a remote-first environment.
- Competitive salaries and equity grants.
Company Overview
Company H1B Sponsorship