Staff SRE (Site Reliability Engineer)
Gigster
- Design, build, and maintain scalable and reliable infrastructure.
- Collaborate with engineering teams to ensure systems are designed with reliability and scalability in mind.
- Evaluate and integrate new technologies to enhance our infrastructure.
- Implement and maintain monitoring and alerting systems to detect and respond to issues promptly.
- Lead incident response efforts, ensuring quick resolution and effective communication.
- Conduct post-incident reviews and drive improvements based on findings.
- Architect & Build innovative automation projects (preferably in Python/GoLang) from scratch to help reduce day-to-day SRE toil
- Create Bash scripts to automate mundate manual activities like upgrades, status checks and deployment
- Develop and maintain infrastructure as code (IaC) using tools such as Terraform, Ansible, or similar.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
- Collaborate with cross-functional teams to deliver high-quality products and services.
- Mentor and guide junior SREs and other team members.
- Advocate for best practices in reliability engineering across the organization.
- Drive initiatives to improve service reliability, capacity, and performance.
- Participate in capacity planning and disaster recovery exercises.
- Stay current with industry trends and emerging technologies.
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- 8+ years of minimum experience in the industry as a Software Engineer, SRE or Platform Engineer.
- Minimum 3+ years of experience as a Platform Engineer or SRE.
- Proven experience in managing large-scale, mission-critical infrastructure.
- Deep understanding of Linux/Unix systems and networking.
- Proficiency in at least one or more programming languages (e.g., Python, Go, Java).
- Intermediate to Expert level skill in bash scripting
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Docker, Kubernetes).
- Strong knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Familiarity with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
- Excellent problem-solving skills and a proactive attitude.
- Strong communication and collaboration skills.
- Ability to work independently and as part of a team.
- Demonstrated leadership and mentoring abilities.
- English Proficiency Assessment (25 mins)
- Technical Assessment (45 mins)
- Recruiter screen (30 mins)
- Technical Interview (30-45 mins)
- World-class network. Be part of a network with the most talented people in the world.
- Amazing cutting-edge projects. Pick the projects from F500 companies that you’re interested in.
- 100% remote and global. Live your best life, wherever that may be, and never lose out on career opportunities because of it.
- Flexible work hours. There is a time to overlap with the customer’s timezone, but most of the time, we work asynchronously and don’t care when you’re online; you just deliver great results.
- Flexible offerings. Choose how many hours you want to work and how much you want to earn.
- Swag! Because who doesn’t love swag?
Source ⇲
remotive.com
To apply, please visit the following URL:https://remotive.com/remote-jobs/devops/staff-sre-site-reliability-engineer-1935492→