As Director, Reliability Engineering (RE), you will inspire a team of diverse RE’s with your leadership and passion for increasing system reliability. You will coordinate RE engagement needs across the organization focused on anticipating, identifying and solutioning for some of the most complex production issues impacting the business. You will provide a single interface to engineering leadership and partner with Product and UX leadership to ensure overall product success.
The ideal candidate is interested in building scalable infrastructure, adding system resiliency, improving developer productivity, automating everything that can (and should) be automated, as well as being a thoughtful people manager and leader. They will oversee a team of RE engineers responsible for overall system health, availability, performance and reducing operational issues as well as the long-term strategy for our infrastructure. You will report to the VP, Technology services and be based in our Pittsburgh office.
Evaluate RE requirements and build out RE team to scale:
Drive the reliability of business critical services in a complex distributed ecosystem
Develop a set of support practices for all Vertical (business facing) Product Teams, as well as Foundational (shared services including Platform, Infrastructure, Security, D&A) technology domains
Partner with domain teams to steer product roadmaps and ensure reliability is built in
Serve as an extension of these domains by discovering ways to improve support operations.
Creating scalable engineering solutions will be at the heart of what you do.
As the primary leader of RE, you will study and understand RE industry best practice and help to elevate the company’s status within the broader RE community.
Oversee day to day Reliability Engineering activities across all Brands and Channels:
Implement best practices to improve scalability of our systems across Store, Omni, eComm, Marketing, Supply Chain, and Corporate Tech as well as horizontal Foundational domains
Establish consistent reliability processes for all Digital and traditional Channels as we support more Brands, Vendors, Products, features, and technology platforms, etc.
Build an ecosystem of Observability to aid in detection, triage, diagnosis and ultimate resolution of business and technology impacting events
Establish and monitor KPIs for reliability, throughput, quality, and controls; deliver dashboards that provide operational and executive views
Perform 24×7 Level 2 support functions for all critical applications, systems, and products
Own system uptime, monitoring/alerting, CI/CD, cloud networking, security, and overall performance
Be a hands-on contributor to projects, including some coding, code reviews, and architectural discussions
Partner with Software Engineering to maximize product and platform reliability through code, tools, and monitoring improvements
Lead the transformation of system reliability, resiliency, and performance for all products and services to the next generation.
Implement Self-Healing solutions to address failures and faults and reduce business impact
Lead the Test Engineering team. Leverage test automation, end to end and exploratory testing to detect issues and flaws before they result in business disruption
By thoughtfully setting strategies for reducing toil you will improve the athlete and teammate experiences and enable our engineering & support organizations to run highly reliable services.
Staff Management and Financial Planning:
Perform staff oversight and financial management for all aspects of functions described in this job description.
Create, implement, and enhance an organization that best supports these responsibilities, and delivers world class operations and support functions to this Fortune 500 company.
Control and manage a budget that leverages technology and automation to delivery seamless and reliable technology execution.
As a technology leader – Participate in overall technology strategy, goal setting, and future vision activities.
Education & Experience:
Bachelor’s degree in Computer Science, related technical field or equivalent practical experience.
10 years of experience with system design, algorithms, data structures, analysis, and software design.
10 years of experience managing a distributed team of engineers
Experience growing and building teams
Experience managing technology infrastructure and conducting technical deep dives into code
5+ years of site reliability engineering, DevOps, or related infrastructure experience
3+ years of engineering management experience
2+ years of retail and/or e-commerce experience
Experience with modern architectures and cloud native design
Experience with cloud infrastructure (Azure, GCP, etc.)
Experience with data streaming platforms like Apache Kafka, and other utility services
Experience with PCF, or similar PaaS providers
Proficiency in data collection and display toolsets (e.g. ELK, Prometheus, etc.)
Familiarity and exposure to Extreme Programming techniques
Prior experience with test engineering and automation tools
Internal Number: 6291
About Synergy Staffing Inc.
Synergy Staffing Inc. delivers exceptionally qualified talent to businesses throughout Western Pennsylvania. Our understanding of our clients’ needs and the thoroughness of our matching process are unparalleled in the local market. Stop spending time searching for qualified talent by making Synergy your go-to resource to meet all your personnel needs.