Role: DevOps Engineer
Location: Hong Kong
The candidate is responsible for daily operations of production and non-production IKP Kubernetes clusters, as well as the management of bundled services such as Istio and Prometheus. Members of the team are expected to work closely with L1 support, Services Engineering, and the IKP Core team. Team members should address incidents and resolve issues, while striving to improve observability and reduce manual toil.
Objectives of this Role
Required Skills and Qualifications
- Run the IKP clusters by monitoring availability and taking a holistic view of system health
- Build tools and automation to manage platform infrastructure and services
- Improve reliability, quality, and time to upgrade cluster and service versions
- Measure and optimize system performance and resource utilization, and plan for future capacity
- Build dashboards and visualizations to graph system health
- Define system alerts and automate responses where possible
- Provide operational support and engineering for multiple software development teams
Daily and Monthly Responsibilities
- Experience with Service Mesh or Overlay Network technology, such as Istio, Linkerd, or Envoy
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3
- Experience maintaining and deploying highly available, fault-tolerant systems at scale
- Practical experience with Docker containerization and clustering with strong Kubernetes/ECS experience, preferably GKE on-prem
- Version control system experience (e.g. Git - Bitbucket/GitHub)
- Experience implementing CI/CD (e.g. Jenkins) as code - using Jenkins file
- DB and Backups (e.g. Postgres)
- CI/CD deployments using Helm and KOPS
- Experience with configuration management tools (e.g. Ansible, Chef)
- Experience with infrastructure-as-code (e.g. Terraform, , Packer, Puppet) –
- A proactive approach to spotting problems, areas for improvement, and performance bottleneck
- Identify, research, and analyze production problems, develop solutions.
- Modernize automated deployment and configuration of infrastructure.
- Promote the use of automation to solve technical challenges, and strive to create elegant, reliable, performant and cost-effective solutions as part of the project deliverables.
- Ensure cluster and service versions are updated predictably and uniformly
- Respond to platform incidents escalated by L1 support
- Participate in system design consulting, platform management, and capacity planning
- Balance feature development speed and reliability with well-defined Service Level Objectives
- Participate in project work, including analysis and requirements definition and testing.
- Identify system enhancements and document business needs.