Senior Software Engineer, Site Reliability Engineering, Vertex AI 3P SRE
Tech Stack / Keywords
Firma i stanowisko
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services have reliability, uptime appropriate to customer's needs, and a fast rate of improvement. Vertex AI is Google Cloud's flagship AI platform, offering tools and services for the entire ML lifecycle, including support for third-party models. This role focuses on ensuring the reliability and performance of the infrastructure underpinning third-party model deployments, especially the GKE-based serving stack.
Wymagania
- Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
- 5 years of experience with software development in one or more programming languages.
- 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems.
- 2 years of experience leading projects and providing technical leadership.
Preferred qualifications:
- Master's degree in Computer Science or Engineering or a related technical field.
- Experience with Load Balancing, high-availability, and scalable system design.
- Experience with Monitoring and Observability in Cloud and cluster management systems (e.g., Cloud Monitoring, Monarch, Automon).
- Experience with Performance analysis and optimization (e.g., Dapper, pprof).
- Experience with cluster management system and Google Cloud interactions.
Obowiązki
- Engage in and improve the whole lifecycle of services from inception and design through to deployment, operation, and refinement.
- Support services before they go live through system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Maintain services once live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless post-mortems (Post-Mortem Examinations).
Inne informacje
Google is an equal opportunity workplace and affirmative action employer committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status. Qualified applicants are considered regardless of criminal histories, consistent with legal requirements. Accommodations for applicants with disabilities are available upon request.
224 aktywne oferty