Prompt
Director of Site Reliability and Cloud Infrastructure
USA Full Time
PromptDevOps / Sysadmin
Share this job
Source: Remotive

About the job

Director of Site Reliability and Cloud Infrastructure

Job Overview: We are seeking a highly skilled and strategic Director of Site Reliability and Cloud Infrastructure to join our team. In this role, you will initially take on the responsibilities of an individual contributor, working hands-on to develop, maintain, and enhance our infrastructure while ensuring security, reliability, and scalability. As you establish a strong foundation, you will also be responsible for collaborating with our existing vendors and scaling the internal team by hiring additional resources focused on security, site reliability, and cloud infrastructure.

This position is perfect for a seasoned leader who thrives in both a hands-on technical role and strategic leadership. You will play a critical part in shaping the future of our infrastructure and ensuring that our systems are both secure and highly available.

Key Responsibilities:

  • Hands-On Infrastructure Management:

    • Develop and maintain scalable and automated infrastructure solutions,

      particularly on AWS.

    • Implement and manage monitoring, alerting, and logging systems to detect and

      address reliability and security risks.

    •  Manage incident response and resolution processes to minimize downtime,

      prevent recurrence, and ensure robust disaster recovery practices.

    • Conduct system performance tuning, capacity planning, and optimization to

      effectively manage resource utilization and loads.

  • Vendor Collaboration and Oversight:

    • Build and maintain strong relationships with cloud, security, and infrastructure vendors, ensuring their services meet performance, compliance, and security needs.

    • Lead contract negotiations and performance reviews for external vendors, ensuring alignment with internal standards and SLAs.

  • Team Building and Leadership:

    • Hire, mentor, and lead a high-performing team of site reliability engineers (SREs),

      security experts, and infrastructure engineers.

    • Develop career growth plans and technical progression frameworks for team

      members, ensuring skills development in cloud technologies and SRE best

      practices.

    • Create a cohesive vision for cloud infrastructure, reliability, and security, aligning

      with the broader organizational goals.

  • Security and Compliance Leadership:

    • Implement and maintain security best practices, including compliance with SOC2, HIPAA, and other relevant standards.

    • Ensure the infrastructure is protected against threats and vulnerabilities.

    • Drive innovation in cloud infrastructure and security, continuously improving our

      processes and systems.

  • Automation and Tooling:

    • Build and maintain automation tools and scripts to streamline system updates, deployments, and monitoring.

    • Design and oversee CI/CD pipelines, ensuring seamless integration with development and operations teams.

  • Collaboration and Stakeholder Management:

    • Work closely with the development, operations, and product teams to ensure

      alignment on priorities and collaboration on large-scale projects.

    • Provide technical guidance and mentorship across teams, championing a culture

    of reliability, automation, and security.

    • Communicate progress, risks, and issues clearly to both technical and

      non-technical stakeholders.

Qualifications:

  • Bachelor’s degree in Computer Science, Engineering, or a related field.

  • Proven experience in a senior leadership role managing cloud infrastructure and site

    reliability, preferably within an AWS environment (EC2, S3, RDS, ELB, etc.).

  • Hands-on experience with infrastructure as code (e.g., Terraform, CloudFormation) and

    automation tools (e.g., Ansible, Jenkins).

  • Strong scripting skills (Python, Bash) and the ability to automate complex tasks.

  • Demonstrated success in scaling infrastructure and teams, particularly within

    high-availability and high-growth environments.

  • Solid understanding of networking, cloud security, and compliance standards (e.g.,

    SOC2, HIPAA).

  • Strong incident management skills and the ability to lead post-incident reviews to drive

    improvements.

  • Excellent communication skills and the ability to collaborate effectively with

    cross-functional teams.

  • Experience in hiring, developing, and managing technical teams with a focus on career

    development and innovation.

Preferred Qualifications:

  • Experience in a high-growth SaaS company, especially within the healthcare or regulated industries.

  • Familiarity with cloud cost optimization, scalability best practices, and disaster recovery strategies.

  • Demonstrated ability to lead through influence, setting technical direction and ensuring execution across teams.

  • Relevant certifications: AWS Solutions Architect, DevOps Engineer, Security; CCSP; CISSP


Perks - What you can expect:

  • Competitive salaries

  • Remote/hybrid environment

  • Potential equity compensation for outstanding performance

  • Flexible PTO

  • Company-wide sponsored lunches

  • Company paid disability and life insurance benefits

  • Company paid family and medical leave

  • Medical, dental, and vision insurance benefits

  • Discounted pet insurance

  • FSA/DCA and commuter benefits

  • 401k


Prompt Therapy Solutions, Inc is an equal opportunity employer, indiscriminate of race, color, religion, ethnicity, ancestry, national origin, sex, gender, gender identity, sexual orientation, age, marital status, veteran status, disability, medical condition, or any other protected characteristic. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Prompt Therapy Solutions, Inc is an E-Verify Employer.

How to apply?
Subscribe for job alerts
We do hate spam. We promise to send only job alerts.


Our Social Presence