[Remote] DevOps Engineer
Note: The job is a remote job and is open to candidates in USA. EPAM Systems is a major technology company specializing in infrastructure supporting AI research, and they are seeking a DevOps Engineer to help maintain production Kubernetes-based systems. The role focuses on site reliability engineering, observability, and SQL production support duties, ensuring system reliability and performance across an Azure Stack environment.
Responsibilities
- Design, maintain and progressively improve observability solutions, including dashboards and visual reports built with Grafana or comparable monitoring tools
- Set up, implement and oversee metrics, SLIs, SLOs and alerting approaches to guarantee reliability and transparency across production systems
- Deliver business-hours operational support for Kubernetes-based production environments, involving initial troubleshooting, log review and metric-based investigations
- Assist with SQL-based systems as part of production operations, contributing to issue examination and performance diagnostics
- Examine incidents and system behavior to pinpoint root causes, take part in post-incident reviews and suggest enhancements for monitoring and reliability practices
- Work hand in hand with engineering, platform and research teams to raise observability standards, refine operational processes and strengthen overall system stability
- Add to documentation, knowledge-sharing activities and ongoing improvement initiatives within the team
Skills
- At least 2 years of relevant hands-on professional experience
- Demonstrated track record in Site Reliability Engineering (SRE), DevOps, Production Support or equivalent roles working with production systems
- Practical exposure to observability and monitoring stacks including Grafana, Prometheus, Elastic Stack, Datadog or similar tools
- Strong command of Linux systems, supported by solid troubleshooting and log analysis capabilities
- Working experience supporting Kubernetes-based environments in production settings
- Background in delivering SQL production support, including query troubleshooting and basic performance diagnostics
- Confident scripting skills in Python, Bash or similar languages for automation and day-to-day operational activities
- Capability to investigate incidents, determine underlying causes and drive continuous improvement efforts
- Effective communication and teamwork skills for working successfully with distributed and cross-functional teams
- Proficient English communication skills, both spoken and written, at a B2+ level or higher
- Experience handling APIs and integration patterns to link services together and enable system interoperability
- Knowledge of databases, covering administration, tuning and production-level support activities
- Exposure to Infrastructure as Code development and maintenance for automating environment provisioning and configuration
- Practical experience using Microsoft Azure to manage cloud resources and run production workloads
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Company Overview
Company H1B Sponsorship