Site Reliability Engineer
Luminance
Sydney, New South Wales, Australia
•2 hours ago
•No application
About
- The Role
- Luminance’s Site Reliability team combines strong problem solving, infrastructure tooling and wider DevOps practices to provide a service of Luminance’s unique software applications. The team plays a crucial role in incident response and issue resolution, swiftly addressing and resolving service interruptions to maintain the highest level of customer satisfaction. With a focus on automation, scalability, reliability and security, the team enable Luminance to ensure a performant, seamless experience for its users. The Site Reliability team is a small, dynamic team of creative engineers and work together to tackle some of Luminance’s greatest challenges, with new problems and technology areas to dig into on a regular basis.
- Roles and Responsibilities
- System Monitoring: Implement, manage, and develop internal monitoring tools to ensure system health and quickly detect anomalies. Respond and resolve incidents efficiently to maintain uptime.
- Automation: Develop automation solutions for infrastructure management, issue resolution and deployment processes, streamlining operations and reducing manual work.
- Infrastructure Management: Manage cloud infrastructure to ensure reliability and scalability, collaborating with teams to design robust solutions.
- Incident Management: Conduct post-incident analysis to identify root causes, implement preventive measures, and enhance system resilience.
- Security and Compliance: Maintain best security practices and compliance standards, working with security teams to address vulnerabilities proactively.
- Collaboration and Communication: Partner with development and operations teams, fostering communication and promoting reliability best practices across the organization.
- Masters in Computer Science, Engineering or related subject from a Go8 University
- Excellent problem-solving skills, including diagnosing issues within complex systems.
- Ability and desire to identify root causes of issues, and propose and implement structural improvements.
- Strong communication skills and capability to perform in scenarios with urgency.
- Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL or Elasticsearch.
- Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, python, AWS, ansible, Prometheus.