Nowa
Site Reliability Engineer (SRE/ DevOps) - Engineering Productivity
Brak informacji o wynagrodzeniu
SeniorFull-time
#346927·Dodano dziś·0
Źródło: Arista NetworksTech Stack / Keywords
DevOpsNetworkingGoPythonShellScriptingLinuxUnix
Firma i stanowisko
Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. They leverage advancements in cloud computing, artificial intelligence, and software-defined networking to provide competitive solutions. The company values diversity and inclusivity and has received awards for Best Engineering Team, Best Company for Diversity, Compensation, and Work-Life Balance.
Wymagania
Essential Skills:
- At least BSc Computer Science or Engineering + 3 years’ experience, MS Computer Science or Engineering + 3 years’ experience, or equivalent work experience.
- Knowledge of one or more of Go, Python, shell scripting to implement medium complexity automation workflows.
- Knowledge of Linux (or UNIX) from administration and debugging perspective.
- Hands-on experience operating software systems (infrastructure, complex applications) at scale.
- Experience in server provisioning, especially from storage and networking perspective.
- Strong problem solving and software troubleshooting skills.
- Experience with infrastructure-as-code.
Desired Skills:
- Experience managing databases such as mariadb, postgres, mongodb.
- Experience with docker and virtualization technologies like kvm, qemu, kata-containers.
- Experience managing monitoring stack including Prometheus, Loki, Tempo, InfluxDB, Grafana, Thanos.
- Experience managing ElasticSearch clusters.
- Experience managing Artifactory, docker registry.
- Experience managing CI/CD systems like ArgoCD, Spinnaker.
- Experience managing version control systems like Perforce, Gerrit.
- Experience with infrastructure-as-code frameworks like Ansible.
- Experience managing large Java applications.
- Experience in storage infrastructure management such as NAS, SAN, Ceph.
Obowiązki
- Build, deploy safely and incrementally and operate critical production systems focusing on scalability, reliability, observability, performance, and security.
- Monitor, support and enhance developer experience across services.
- Build automation to remove toil and efficiently operate production systems.
- Proactively monitor, respond to, and enhance alerts and set up automated alert handling.
- Create and maintain incident response runbooks.
- Triage platform/infrastructural issues and assist Arista software engineers in their triages; engage with 3rd party vendor support.
- Write postmortem documents and build solutions to prevent incident recurrence.
- Plan and communicate maintenance windows on production systems.
- Work with product development teams to identify infrastructural bottlenecks and design and implement solutions.
- Survey and adopt best practices around infrastructure/platform to maintain secure, scalable, and fault-tolerant systems.
- Study design and implementation details of OSS systems for better triage and fix resolution.
Arista Networks
14 aktywnych ofert