Site Reliability Engineer
Elastic is the world's leading software provider for making structured and unstructured data usable in real time for search, logging, security and analytics use cases. Founded in 2012 by the people behind the Elasticsearch, Kibana, Logstash, and Beats open source projects, Elastic's global community has more than 50,000 members across 45 countries, and since its initial release, Elastic's products have achieved more than 50 million cumulative downloads. Today thousands of organizations like Cisco, eBay, Dell, Goldman Sachs, Groupon, HP, Microsoft, Netflix, NY Times, Uber, Verizon, Yelp, and Wikipedia use the Elastic Stack, X-Pack, and Elastic Cloud to power mission critical systems that drive new revenue opportunities and massive cost savings. Elastic is backed by $104 million in funding from Benchmark Capital, Index Ventures, and NEA; has headquarters in Amsterdam and Mountain View, California; and has over 300 employees in over 30 countries around the world.
Thanks to our ongoing expansion we have the opportunity to grow our Cloud Site Reliability team. We're a part of the Elastic Cloud team with an operations background who aren’t afraid to get our hands dirty. We are the first line of consumers for Elastic's products and our experience helps influence the direction of the product. While most organizations may have a single or a handful of Elastic Stack deployments, here, you’ll be responsible for identifying, troubleshooting and reporting platform problems to developers in order to ensure that the thousands of Elasticsearch clusters that we manage are providing a stable and reliable service. We’re looking for people who are just as excited about troubleshooting issues with distributed systems as they are to automate, code and collaborate to solve problems.
- Report and troubleshoot problems within the Elastic Cloud infrastructure services and collaborate on issues with developers
- Handle day to day operations around the Elastic Cloud such as customer trouble tickets managing cloud provider infrastructure (maintenance/expansion), and software deployments
- Develop and enhance tooling to deploy and manage the Elastic Cloud product and infrastructure
- Demonstrate and promote best practices for teams using cloud platforms
- Multiple years hands-on experience administering Linux, preferably with distributed systems with some scale
- 1+ years of AWS, and/or, GCP and/or Azure experience is a must
- Experience automating production Linux systems collaboratively, deriving configuration through version control
- Comfortable writing software to automate API-driven tasks at scale; The SRE team uses Python and some Go, where the developers use Scala, Python, and Java
- Have used Ansible/Puppet/Chef or another config management suite, know where it's broken, and open to trying new things
- Healthy knowledge of Linux (have compiled your own kernel at some point, know how to trace syscalls, understand TCP, care about the difference between sysvinit/runit/systemd, etc.)
- Relentless desire to automate and build software tools
- Desire to represent work in git, driven by a GitHub workflow through issues and pull requests
- Love open source development, and have contributed to some project somewhere (doesn't have to be ours), whether it's mailing lists, patches, documentation, etc.
- Enjoy working remotely and the communication it requires
- Love a diverse environment, working with men and women all over the world
Elastic is an Equal Employment employer committed to the principles of equal employment opportunity and affirmative action for all applicants and employees. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. Elastic also makes reasonable accommodations for disabled employees consistent with applicable law.