Portfolio Jobs

Discover opportunities across our network

companies

Jobs

My job alerts

System Reliability Engineer

Backops Ai

Other Engineering

San Francisco, CA, USA

Posted on Feb 10, 2026

Apply now

System Reliability Engineer

San Francisco

Engineering

Hybrid

Full-time

Systems Reliability Engineer (SRE)

San Francisco • Hybrid • Full-time

About BackOps AI

BackOps AI is transforming supply chain operations with agentic AI solutions that automate complex workflows, freeing operations teams to focus on what matters most. Headquartered in the San Francisco Bay Area with flexible remote-friendly options, we foster a culture of innovation, ownership, and measurable impact.

Role Overview

As a Systems Reliability Engineer (SRE), you’ll own the reliability, scalability, and security posture of the platforms that power our agentic workflows. You’ll build the guardrails and operational foundations that let product and AI teams ship quickly without sacrificing uptime, observability, or customer trust. We run primarily on AWS; familiarity with GCP is a plus.

What You’ll Do

Reliability & Availability: Define and improve SLOs/SLIs, reduce error budget burn, and drive initiatives that improve uptime and customer experience
Incident Response: Lead and/or participate in on-call rotations; run incident response, coordinate remediation, and produce clear postmortems with measurable follow-ups
Observability: Build end-to-end observability (metrics, logs, tracing), dashboards, alerts, and runbooks that make issues diagnosable quickly across services and agents
Cloud Operations (AWS): Improve and maintain AWS foundations (IAM, VPC/networking, compute, storage, monitoring, logging, and security controls)
Infrastructure as Code: Build and maintain repeatable infrastructure using IaC; enforce consistency across environments (dev/stage/prod) and reduce configuration drift
Deployment & CI/CD: Improve deployment safety and velocity (progressive rollouts, rollback strategies, canary patterns, automation in CI/CD)
Security & Compliance: Implement and operationalize security best practices (least privilege IAM, secrets management, audit logging, network segmentation) and support SOC 2–aligned controls
Performance & Cost: Identify bottlenecks and reliability risks; tune compute/database/network performance and optimize cloud spend without compromising availability
Data Protection: Own backup/restore strategies, disaster recovery plans, retention/deletion execution, and periodic recovery testing

What We’re Looking For

Experience: 4+ years in SRE/DevOps/Infrastructure roles supporting production systems with meaningful uptime requirements
AWS Expertise: Strong hands-on experience operating workloads in AWS (IAM, VPC/networking, compute, storage, monitoring, and security controls)
Systems Thinking: Solid understanding of distributed systems failure modes (timeouts, retries, cascading failures), and how to design for resilience
Operational Excellence: Strong incident leadership instincts; comfortable being the calm, methodical driver during outages
Automation Mindset: You automate first—repeatable environments, scripted operations, and minimal manual toil
Clear Communicator: Can write crisp runbooks, postmortems, and technical proposals; able to align engineering, product, and ops on priorities
Security & Quality: Proven ability to improve security posture and reliability without blocking delivery

Nice to Have (Tools & Stack)

CloudWatch: Strong experience with CloudWatch Logs/Metrics/Alarms, dashboarding, and alert hygiene
Sentry: Experience operating error monitoring and triage workflows (alert tuning, release health, actionable grouping)
LangSmith: Familiarity with LLM/agent observability (trace analysis, evals/monitoring signals, debugging agent failures)
incident.io: Experience running incident workflows (paging, incident timelines, postmortems, follow-up tracking)
GCP: Experience operating production systems on GCP (or hybrid/multi-cloud environments)
Kubernetes experience (or deep experience with managed platforms and production deployment patterns)
Strong background in compliance-oriented environments (SOC 2), audit readiness, and control implementation

What We Offer

Equity & Ownership: Competitive equity so you grow alongside the company
Impact & Visibility: Direct access to co-founders; your work directly improves customer trust and operational outcomes
Collaborative Culture: Tight-knit team of seasoned operators and AI experts
Flexible Work: Hybrid with core Bay Area presence and remote flexibility

Ready to apply?

Powered by

First name *

Last name *

Email *

LinkedIn URL

Resume *

Click to upload or drag and drop here

How did you hear about us? *

Why are you interested in BackOps? *

What are you most proud of working on in your career? *

What are you looking for in your next role? *

What is a new technology that excited you? *

Location *

Are you based in the San Francisco Bay Area and open to commuting to San Francisco 3 days/week? *

Yes

No

Will you now or in the future require visa sponsorship? If yes, select the type of sponsorship. *

Yes

No

Please share your portfolio with us

Req ID: R20

Apply now

See more open positions at Backops Ai

Privacy policy Cookie policy