About the job
What we do
At Perlego, there are over 100 of us working hard to make education accessible to all. In this digital age, we believe that anyone should be able to learn anything at any time. Knowledge should be more accessible, not locked behind sky-high price tags.
Over the past 5 years, our goal has been to support students across the UK & Europe to access quality books. The next stage of Perlego is twofold: 1) expand our support to students globally, and 2) build a product that goes beyond the book, a platform that helps students study smarter and more effectively.
What we're looking for:
We are looking for an experienced Site Reliability Engineer (SRE) with a strong background in AWS services and monitoring tools. In this role, you will ensure the availability and reliability of our services, especially during out-of-office hours, while most of the team is based in Europe and India. You will be integral to swiftly addressing issues, resolving incidents independently, and thriving in a fast-paced environment.
How we collaborate:
Our organization operates across multiple time zones, with teams based in across Europe. As an SRE, you will provide critical support during off-hours, working autonomously to resolve issues while collaborating closely with our teams to ensure continuous service availability. You will be part of a global team, supporting cloud infrastructure and platform initiatives.
What you’ll do:
As a Site Reliability Engineer, your main focus will be to ensure our services remain highly available and performant. Key responsibilities include:
Monitoring & Incident Management:
- Monitor and manage platform activity using tools like Datadog, Prometheus, Grafana, or AWS CloudWatch.
- Respond quickly to alerts and incidents, independently resolving issues and ensuring service uptime during off-peak hours.
- Conduct post-incident reviews and help improve system resiliency through automation and monitoring enhancements.
Cloud Infrastructure Management:
- Manage and support AWS infrastructure, focusing on scalability, security, and reliability.
- Handle deployments, managing CI/CD pipelines for both containerized (Docker/Kubernetes) and serverless (AWS Lambda) applications.
- Ensure effective backup, recovery, and disaster recovery strategies to minimize downtime.
Collaboration & Communication:
- Collaborate with cross-functional teams to implement platform improvements.
- Work independently and make swift decisions when managing service incidents outside core business hours.
- Assist in platform security, ensuring adherence to best practices for cloud security and compliance.
Continuous Improvement:
- Automate manual processes to reduce human error and improve efficiency.
- Continuously enhance monitoring systems, ensuring robust early detection and resolution capabilities.
- Identify potential performance bottlenecks and contribute to overall platform optimization.
Requirements
This role is ideal for you if you possess:
- Experience in Site Reliability Engineering, DevOps, or a similar field.
- Strong experience with AWS services
- Expertise in using monitoring tools (e.g. Prometheus, Grafana, CloudWatch) for real-time platform performance insights.
- Hands-on experience with CI/CD pipeline management for deploying containerized (Docker) and serverless applications.
- Proficiency in Linux-based operating systems and shell scripting.
- Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
- Experience with incident management, troubleshooting, and platform recovery in high-pressure environments.
- Strong communication skills with a proven ability to work both independently and collaboratively across time zones.
⭐️ It’s a plus if you have:
- Experience working in a global, distributed team providing off-hours support.
- Knowledge of container orchestration tools.
- Previous experience with SecOps and cloud security best practices.
- Familiarity with scaling highly available systems in a fast-paced, growth-oriented environment.
Benefits