IBM WebSphere Liberty Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE is responsible for the availability and reliability of the IBM WebSphere Liberty Service, and ensure they meet the requirements of both internal and external users. We look for engineers who are motivated to collaborate with our Development squads to build and run sustainable production systems, and can evolve and adapt to changes in our fast-paced, worldwide environment. You should have a strong desire to work within a CI/CD environment and have a passion for embracing new cloud technologies and working with our customers to ensure they are successful. You need to be collaborative, able to handle responsibility, and love learning new techniques and tools. There is no requirement to be an expert in any one language or technology. However, knowledge of Go, Bash, Python, ArgoCD, Jenkins, Docker, Kubernetes, Openshift, or IBM Cloud/AWS/MS Azure would be useful. Knowledge in operating highly-available, zero-downtime production environments would also be beneficial. The key requirement is to have a passion for supporting, operating and developing a high-quality, highly available service.
Broadly, responsibilities include:
• Scaling and managing the service in the Cloud environments
• Create sustainable systems and services through automation
• Ensure a healthy Production environment by monitoring availability and applying fault diagnosis and remediation as appropriate
• Drive incident management process and support a blameless post-mortems culture
• Partner with development teams to improve services via rigorous testing and release procedures
• Participate in system design consulting, platform management, and capacity planning
• Ensure systems are Secure and Compliant