Site Reliability Engineer

Location: Ho Chi Minh City
Job Type: Permanent
Salary: Negotiable
Contact: Chelsea Phan
Email: email Chelsea
Reference: BBBH10721_1699421401
Posted: 8 months ago

Site Reliability Engineer
Ho Chi Minh City

About NextWave

NextWave Partners is the Recruitment Partner of choice within the Clean Energy, Sustainable Infrastructure, ESG, Impact Investment, Climate-Tech & Technology sectors. We are committed to supporting industries battling climate change towards a net-zero future and a sustainable economy.

About the role

Our client is a high growth start-up, who is looking for a Site Reliability Engineer. In this role, you would develop, maintain, and support a complex modern web application while learning and applying DevOps practices. This opportunity will provide you with hands-on experience in cutting-edge ideas, technologies, and contemporary frameworks like Typescript, Docker, AWS, and Terraform. You'll have the chance to refine your expertise in system architecture, resilience, and managing incidents. You'll report to the DevSecOps Lead, and your primary location will be our newly established office in HCMC, where you'll work closely with our team in Singapore.

Roles and responsibilities

  • Cultivate a profound understanding of infrastructure and application systems, both their internal and external dependencies.
  • Implement and uphold a monitoring and alerting system to detect performance bottlenecks, system failures, and other issues.
  • Create and assess disaster recovery plans to ensure the uninterrupted operation of the business.
  • Comprehend and oversee production code through Infrastructure as Code (IaC).
  • Manage and provide support for multiple applications and infrastructure elements.
  • React to production incidents, and coordinate collaborative efforts across various teams and third-party partners to resolve issues.
  • Lead root cause analyses (RCAs) and post-mortem assessments after incident resolution.
  • Independently acquire proficiency in new tools and technologies as dictated by project requirements.
  • Document solutions and tooling, disseminate knowledge, and provide training as needed.
  • Formulate operational guides (run books) and scripts for automated issue resolution.


  • A Bachelor's Degree or equivalent practical experience.
  • More than 5 years of IT experience with a focus on Enterprise Cloud infrastructure (GCP, AWS, or Azure), with at least 4 years in AWS, particularly in a mission-critical environment.
  • Extensive expertise in cloud architecture with a focus on resilience and security.
  • Proficiency in identifying potential system bottlenecks and proposing enhancements.
  • Sound comprehension of microservices, event-driven architectures, and the adoption of DevSecOps practices for supporting complex distributed systems.
  • Background in maintaining and supporting cloud-related infrastructures, with a preference for experience in managing ECS and AWS Services.
  • Hands-on experience in conducting resilience, chaos, and stress testing within a cloud context.
  • Comprehensive understanding of various concepts, technologies, and frameworks.
  • Solid knowledge of Unix/Linux operating systems, as well as proficiency in Python, Bash, Shell scripting, and SQL.
  • Familiarity with engineering practices, including test automation, CI/CD, and release automation.
  • Experience in incident response planning and automation.
  • A self-reliant problem solver with a strong work ethic, capable of adhering to established architectural constraints.
  • Willingness to participate in 24/7 on-call rotations (overtime or off-in-lieu provided).
  • Proficiency in both written and spoken English.
  • Profound knowledge of operating systems (RHEL, Ubuntu, Windows Server) with excellent debugging, troubleshooting, and problem-solving skills.
  • Expertise in one of the following programming languages: Python, Shell, Golang, or JavaScript, emphasizing Site Reliability Engineering and the support of cloud services.
  • Practical experience with cloud-based technologies and tools, particularly in deployment, monitoring, and operations, such as New Relic, Zabbix, CloudWatch, Snyk+Fugue, Grafana, and Prometheus.
  • Strong familiarity with modern development technologies and tools, including Agile, CI/CD, Git, Terraform, and CircleCI.
  • A solid understanding of networking protocols and cybersecurity best practices in a cloud environment.
  • AWS certification is highly desirable

If you are interested in this position, please apply directly on the platform with your latest CV. We will review your application and revert back promptly.
Keep in touch
If you would wish to keep up to date with the latest NextWave opportunities and industry updates, please follow us on LinkedIn and create your profile on our website to receive a weekly newsletter in your inbox!
Our commitment
Diversity is a core value at NextWave Partners, and we are proud to be partnering with equal opportunities employers. All qualified applicants will receive consideration for employment without regard to race, colour, religion, gender, gender identity or expression, sexual orientation, national origin, disability or age.

EA Registration No: R2199999
NextWave Partners Ltd. (EA License No: 16S8303 - UEN: 201602833E)