Responsibility
- Setup and maintain monitoring, metrics& reporting systems for fine-grained observability and actionable alerting
- Exceptional prioritization skills, including transparently communicating and justifying time investments
- Assess and mitigate performance bottlenecks and system risks
- Ensure the service performance, availability and reliability of all servers, services and microservices within the SaaS
- Influence architectural decisions with focus on security, scalability and high-performance
- Take part in a 24×7 on-call rotation for full-cycle incident response (mitigation, correction, prevention)
Qualification
- 5+ years building and managing services in a distributed high availability web application
- Strong understanding of network layers and advanced traffic routing and caching flows
- Hands on experience in cloud computing with a preference to GCP/AWS of infrastructure and system architecture in large scale applications
- Advanced skills with Linux, networking, storage and virtualization automation with tools like Kubernetes, Terraform, Ansible
- Ability to manage competing priorities, and work well under pressure
- Self-driven with an analytical mind with a bias for action
技能標籤: AWS, DNS, firewall, github, linux, load balancing, rdbms