System Reliability Engineer
Backops Ai
Other Engineering
San Francisco, CA, USA
Posted on Feb 10, 2026
System Reliability Engineer
San Francisco
Engineering
Hybrid
Full-time
Systems Reliability Engineer (SRE)
San Francisco • Hybrid • Full-time
About BackOps AI
BackOps AI is transforming supply chain operations with agentic AI solutions that automate complex workflows, freeing operations teams to focus on what matters most. Headquartered in the San Francisco Bay Area with flexible remote-friendly options, we foster a culture of innovation, ownership, and measurable impact.
Role Overview
As a Systems Reliability Engineer (SRE), you’ll own the reliability, scalability, and security posture of the platforms that power our agentic workflows. You’ll build the guardrails and operational foundations that let product and AI teams ship quickly without sacrificing uptime, observability, or customer trust. We run primarily on AWS; familiarity with GCP is a plus.
What You’ll Do
- Reliability & Availability: Define and improve SLOs/SLIs, reduce error budget burn, and drive initiatives that improve uptime and customer experience
- Incident Response: Lead and/or participate in on-call rotations; run incident response, coordinate remediation, and produce clear postmortems with measurable follow-ups
- Observability: Build end-to-end observability (metrics, logs, tracing), dashboards, alerts, and runbooks that make issues diagnosable quickly across services and agents
- Cloud Operations (AWS): Improve and maintain AWS foundations (IAM, VPC/networking, compute, storage, monitoring, logging, and security controls)
- Infrastructure as Code: Build and maintain repeatable infrastructure using IaC; enforce consistency across environments (dev/stage/prod) and reduce configuration drift
- Deployment & CI/CD: Improve deployment safety and velocity (progressive rollouts, rollback strategies, canary patterns, automation in CI/CD)
- Security & Compliance: Implement and operationalize security best practices (least privilege IAM, secrets management, audit logging, network segmentation) and support SOC 2–aligned controls
- Performance & Cost: Identify bottlenecks and reliability risks; tune compute/database/network performance and optimize cloud spend without compromising availability
- Data Protection: Own backup/restore strategies, disaster recovery plans, retention/deletion execution, and periodic recovery testing
What We’re Looking For
- Experience: 4+ years in SRE/DevOps/Infrastructure roles supporting production systems with meaningful uptime requirements
- AWS Expertise: Strong hands-on experience operating workloads in AWS (IAM, VPC/networking, compute, storage, monitoring, and security controls)
- Systems Thinking: Solid understanding of distributed systems failure modes (timeouts, retries, cascading failures), and how to design for resilience
- Operational Excellence: Strong incident leadership instincts; comfortable being the calm, methodical driver during outages
- Automation Mindset: You automate first—repeatable environments, scripted operations, and minimal manual toil
- Clear Communicator: Can write crisp runbooks, postmortems, and technical proposals; able to align engineering, product, and ops on priorities
- Security & Quality: Proven ability to improve security posture and reliability without blocking delivery
Nice to Have (Tools & Stack)
- CloudWatch: Strong experience with CloudWatch Logs/Metrics/Alarms, dashboarding, and alert hygiene
- Sentry: Experience operating error monitoring and triage workflows (alert tuning, release health, actionable grouping)
- LangSmith: Familiarity with LLM/agent observability (trace analysis, evals/monitoring signals, debugging agent failures)
- incident.io: Experience running incident workflows (paging, incident timelines, postmortems, follow-up tracking)
- GCP: Experience operating production systems on GCP (or hybrid/multi-cloud environments)
- Kubernetes experience (or deep experience with managed platforms and production deployment patterns)
- Strong background in compliance-oriented environments (SOC 2), audit readiness, and control implementation
What We Offer
- Equity & Ownership: Competitive equity so you grow alongside the company
- Impact & Visibility: Direct access to co-founders; your work directly improves customer trust and operational outcomes
- Collaborative Culture: Tight-knit team of seasoned operators and AI experts
- Flexible Work: Hybrid with core Bay Area presence and remote flexibility
First name *
Last name *
Email *
LinkedIn URL
Resume *
Click to upload or drag and drop here
How did you hear about us? *
Why are you interested in BackOps? *
What are you most proud of working on in your career? *
What are you looking for in your next role? *
What is a new technology that excited you? *
Location *
Are you based in the San Francisco Bay Area and open to commuting to San Francisco 3 days/week? *
Will you now or in the future require visa sponsorship? If yes, select the type of sponsorship. *
Please share your portfolio with us
Req ID: R20