Teleperformance

Site Reliability Engineering (SRE) - BPO - Kuala Lumpur

Job Locations MY-Kuala Lumpur-Kuala Lumpur
Requisition Post Information* : Posted Date 2 days ago(5/15/2026 8:29 AM)
Requisition ID
2026-82241
Category
Information Technology
Country
Malaysia

Overview

Site Reliability Engineering (SRE) combines software and systems engineering with the art of machine learning to build and run large-scale, massively distributed, and fault-tolerant systems. You will have the opportunity to sharpen your expertise in coding, performance analysis, and large-scale system design while making a tangible impact on the future of our organization Infrastructure services and AML systems.

Qualifications

Preferred Skills
•Experience with containers and container orchestration platforms such as Docker and Kubernetes.
•Proficiency in or exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle.
•Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
•Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.

Technical Requirements
•Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java) and shell scripting. Strong understanding of data structures and algorithms.
•Systems: Strong understanding of Linux operating systems and open-source technologies and a solid understanding of network architecture.
•Databases: Competent knowledge of relational database systems and database modeling.

Minimum Skills
•Education: bachelor’s or master’s degree in computer science, Information Technology, Computer Engineering, or a related field.
•Experience: 3+ years of experience as a Site Reliability Engineer, Systems Engineer, or Software Engineer.

Responsibilities

Key Responsibilities and Accountabilities
•Design, build, and maintain highly available, scalable, and fault-tolerant systems. Collaborate with software engineering teams to ensure applications are designed with reliability and performance in mind.
•Develop and maintain automation procedures to maximize system efficiency, minimize human intervention, and optimize routine tasks.
•Monitor and analyze system performance to identify and address bottlenecks before they impact users. Ensure the infrastructure can handle rapid growth in web traffic and ML data processing.

Main Job Requirements
•Participate in 24/7 on-call rotations (including scheduled shifts and holidays). Practice sustainable on-call response, conduct root-cause analysis, and lead blameless post-mortems to prevent recurrence.
•Implement monitoring tools (SLIs/SLOs/SLAs) and set up automated alerting and metrics to track system health and performance.
•Implement and maintain security best practices and ensure all systems meet regulatory requirements.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed