DevOps/SRE Engineer
We are looking for a DevOps/SRE Engineer to support the reliable delivery and day-to-day operation of AI-driven capabilities within customer-facing products in Minneapolis, Minnesota. This position is focused on production readiness, service stability, and smooth integration across platforms rather than building the underlying AI features. The ideal candidate will help ensure new functionality is introduced safely, monitored effectively, and maintained with a strong emphasis on performance, cost awareness, and customer impact.<br><br>Responsibilities:<br>• Lead the release of AI-enabled product capabilities from pre-production validation through live deployment, ensuring launches are controlled and dependable.<br>• Oversee production health by tracking availability, response times, failures, and service quality, and take prompt action when issues affect performance.<br>• Maintain and improve connections between external AI providers, internal model services, and customer-facing applications to support reliable functionality.<br>• Administer API credentials, usage thresholds, vendor quotas, and spend controls, proactively identifying risks related to capacity or budget.<br>• Create and refine operational dashboards, alerting rules, and response documentation to strengthen support for AI-related incidents.<br>• Work closely with product and engineering partners to plan staged rollouts, feature gating, rollback paths, and low-risk release strategies.<br>• Support customer-facing teams by explaining AI feature readiness, expected delivery timelines, and practical capabilities in clear business terms.<br>• Participate in customer or sales discussions when technical expertise is needed, helping address questions about solution behavior, roadmap direction, and use case alignment.<br>• Manage investigation and resolution of customer-impacting incidents by coordinating with internal stakeholders and external vendors while providing timely updates.<br>• Monitor usage patterns, operating costs, vendor changes, model retirements, and security notices, and prepare tested mitigation or migration plans before service is affected.
• At least 3 years of experience in DevOps, site reliability, software engineering, or production operations supporting live customer environments.<br>• Strong programming ability in Python with practical experience working with APIs, webhooks, and asynchronous service interactions.<br>• Proven background operating systems in production with an understanding of reliability, scalability, and incident handling under real-world load.<br>• Hands-on experience with monitoring and observability platforms such as Datadog, Grafana, New Relic, Amazon CloudWatch, or comparable tools.<br>• Familiarity with at least one major AI platform, including OpenAI, Claude, Azure OpenAI, Amazon Bedrock, or Google Vertex AI, along with production concerns such as latency, fallback design, rate limits, and cost control.<br>• Working knowledge of cloud infrastructure and CI/CD practices used to deploy, update, and maintain services consistently.<br>• Ability to write clear operational documentation, including runbooks and post-incident summaries, and to lead communication during service disruptions.<br>• Strong communication skills with the confidence to explain technical topics to non-technical stakeholders and customers while maintaining sound security and data-handling practices.
<h3 class="rh-display-3--rich-text">Technology Doesn't Change the World, People Do.<sup>®</sup></h3>
<p>Robert Half is the world’s first and largest specialized talent solutions firm that connects highly qualified job seekers to opportunities at great companies. We offer contract, temporary and permanent placement solutions for finance and accounting, technology, marketing and creative, legal, and administrative and customer support roles.</p>
<p>Robert Half works to put you in the best position to succeed. We provide access to top jobs, competitive compensation and benefits, and free online training. Stay on top of every opportunity - whenever you choose - even on the go. <a href="https://www.roberthalf.com/us/en/mobile-app" target="_blank">Download the Robert Half app</a> and get 1-tap apply, notifications of AI-matched jobs, and much more.</p>
<p>All applicants applying for U.S. job openings must be legally authorized to work in the United States. Benefits are available to contract/temporary professionals, including medical, vision, dental, and life and disability insurance. Hired contract/temporary professionals are also eligible to enroll in our company 401(k) plan. Visit <a href="https://roberthalf.gobenefits.net/" target="_blank">roberthalf.gobenefits.net</a> for more information.</p>
<p>© 2025 Robert Half. An Equal Opportunity Employer. M/F/Disability/Veterans. By clicking “Apply Now,” you’re agreeing to Robert Half’s <a href="https://www.roberthalf.com/us/en/terms">Terms of Use</a> and <a href="https://www.roberthalf.com/us/en/privacy">Privacy Notice</a>.</p>
- Minneapolis, MN
- onsite
- Permanent / Full Time
-
100000 - 125000 USD / Yearly
- We are looking for a DevOps/SRE Engineer to support the reliable delivery and day-to-day operation of AI-driven capabilities within customer-facing products in Minneapolis, Minnesota. This position is focused on production readiness, service stability, and smooth integration across platforms rather than building the underlying AI features. The ideal candidate will help ensure new functionality is introduced safely, monitored effectively, and maintained with a strong emphasis on performance, cost awareness, and customer impact.<br><br>Responsibilities:<br>• Lead the release of AI-enabled product capabilities from pre-production validation through live deployment, ensuring launches are controlled and dependable.<br>• Oversee production health by tracking availability, response times, failures, and service quality, and take prompt action when issues affect performance.<br>• Maintain and improve connections between external AI providers, internal model services, and customer-facing applications to support reliable functionality.<br>• Administer API credentials, usage thresholds, vendor quotas, and spend controls, proactively identifying risks related to capacity or budget.<br>• Create and refine operational dashboards, alerting rules, and response documentation to strengthen support for AI-related incidents.<br>• Work closely with product and engineering partners to plan staged rollouts, feature gating, rollback paths, and low-risk release strategies.<br>• Support customer-facing teams by explaining AI feature readiness, expected delivery timelines, and practical capabilities in clear business terms.<br>• Participate in customer or sales discussions when technical expertise is needed, helping address questions about solution behavior, roadmap direction, and use case alignment.<br>• Manage investigation and resolution of customer-impacting incidents by coordinating with internal stakeholders and external vendors while providing timely updates.<br>• Monitor usage patterns, operating costs, vendor changes, model retirements, and security notices, and prepare tested mitigation or migration plans before service is affected.
- 2026-05-22T00:00:00Z