Senior DevOps SRE

As a Senior DevOps Site Reliability Engineer, you:

  • Have an automation first mindset

  • Are passionate about performance, stability, and security

  • Believe in a proactive approach of prevention over mitigation, and mitigation over fixing

  • Are comfortable with change

  • Have had a positive experience working for a startup before

  • Are a U.S. Citizen with an active clearance or willing and able to undergo the clearance process, including polygraph

Required Skills - advanced knowledge of:

  • AWS — 3+ years of hands-on experience (Architect / DevOps / SysOps AWS Certification preferred)

  • Infrastructure as a Code (Terraform)

  • Ansible automation

  • Kubernetes — 2+ years of in-depth experience deploying production applications / containers orchestration

  • K8S scheduling, networking, security, load-balancing

  • CI/CD (GitLab, Jenkins or Bamboo)

  • Python, Perl, or Golang

  • Best practices and IT operations in an always-up, always-available mission critical service

Desired Experience:

  • Implementing observability and monitoring in AWS, using Splunk / ELK / similar

  • EKS, ECS , ECR

  • Working in an agile environment, focused on rapid cycles and CD

  • Supporting, analyzing, and troubleshooting large-scale distributed mission-critical systems

  • Building software and/or platforms where security, regulatory compliance and high availability are critical

  • Strong understanding of Information Security in various environments

Responsibilities

  • Implement and support FedRAMP and other applicable USG standards, policies, and regulations

  • Set up, integrate, and maintain a scalable, stable set of CI/CD tools to support development, testing, and security scanning

  • Be accountable for a large-scale SaaS app w/a mission-critical customer base

  • Manage multiple tools, infrastructure, and roles in a fast-paced environment

  • Own the availability of our SaaS infrastructure and application

  • Implement best-in-class AWS solution using infrastructure as code

  • Collaborate with engineering and product to continuously improve service availability and quality

  • Be involved in the entire production lifecycle: code deployments, infrastructure management, and troubleshooting

  • Share ownership w/the Dev team, and own service availability and proactive issue prevention, using structured troubleshooting to mitigate issues

  • Work closely with our Dev and DevOps teams to ensure that our production services are secure, scalable, performant, and resilient