96 Site Reliability Engineer jobs in Saudi Arabia
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a
Site Reliability Engineer (SRE)
with deep experience in
Application Performance Management (APM)
tools like
Dynatrace
and
AppDynamics
. In this role, you will play a crucial part in maintaining and enhancing the
reliability, performance, and scalability
of our applications and infrastructure.
You will collaborate closely with development and operations teams to build robust systems, implement automation, and proactively manage performance and availability through modern observability and incident management practices.
Key Responsibilities
- System Reliability & Performance:
Design and manage high-availability, scalable infrastructure. - APM Tooling:
Leverage tools such as Dynatrace, AppDynamics, and New Relic to monitor and optimize application performance. - Incident Management:
Respond to incidents, conduct root cause analysis, and implement long-term fixes. - Automation:
Build automation frameworks and scripts to streamline deployment, monitoring, and operations. - Monitoring & Alerting:
Develop and maintain robust observability stacks to track system health and performance. - Cross-Team Collaboration:
Work with developers and product teams to ensure reliability of new features and releases. - Capacity Planning:
Analyze usage trends to plan for future scaling needs. - Documentation:
Maintain up-to-date documentation of systems, processes, and infrastructure. - Continuous Improvement:
Drive initiatives that improve reliability, scalability, and security.
Qualifications
- Education:
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience). - Experience:
5+ years in a Site Reliability Engineer or DevOps role.
Technical Skills
- Hands-on experience with APM tools (Dynatrace, AppDynamics, New Relic).
- Experience with public cloud platforms: AWS, GCP, or Azure.
- Strong scripting skills (Python, Bash, etc.).
- Familiarity with configuration management (Ansible, Puppet, or Chef).
- Solid understanding of containerization and orchestration (Docker, Kubernetes).
- Proficiency in monitoring/logging tools: Prometheus, Grafana, ELK stack.
- Strong understanding of networking concepts and protocols.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are seeking an experienced Site Reliability Engineer (SRE) with expertise in Application Performance Management (APM) tools such as Dynatrace and AppDynamics. The ideal candidate will focus on ensuring the reliability, availability, and performance of our applications and systems. You will collaborate with development and operations teams to build and maintain scalable infrastructure, automate processes, and implement robust monitoring and alerting solutions.
Key Responsibilities:
System Reliability and Performance: Design, implement, and maintain high-availability infrastructure to support our applications and services.
Application Performance Management: Utilize APM tools like Dynatrace and AppDynamics to monitor and optimize application performance.
Incident Management: Respond to and resolve incidents, conduct post-mortems, and implement solutions to prevent future issues.
Automation: Develop and implement automation tools and frameworks to enhance operational efficiency.
Monitoring and Alerting: Create and maintain comprehensive monitoring and alerting systems to ensure the health and performance of applications and infrastructure.
Collaboration: Work with development teams to ensure the reliability and performance of new features and releases.
Capacity Planning: Analyze current infrastructure usage and plan for future capacity needs.
Documentation: Maintain detailed and accurate documentation of system architecture, processes, and procedures.
Continuous Improvement: Identify and implement best practices for scalability, reliability, and security.
Qualifications:
Education: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
Experience: 5+ years of experience in a Site Reliability Engineer or DevOps role.
Location: Onsite , Arabic speaking preferred
Technical Skills:
Proficiency with APM tools such as Dynatrace, AppDynamics, New Relic.
Experience with cloud platforms (AWS, GCP, Azure).
Strong scripting and automation skills (Python, Bash, etc.).
Familiarity with configuration management tools (Ansible, Puppet, Chef).
Knowledge of containerization and orchestration (Docker, Kubernetes).
Understanding of monitoring and logging tools (Prometheus, Grafana, ELK stack).
In-depth knowledge of networking concepts and protocols.
Soft Skills:
Excellent problem-solving and analytical skills.
Strong communication and collaboration abilities.
Ability to work in a fast-paced, dynamic environment.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role
We are looking for a Site Reliability Engineer (SRE) to join our engineering team and help design, build, and maintain reliable, scalable, and high-performing systems. You will work closely with software engineers, DevOps, and infrastructure teams to ensure smooth operations, automation, and continuous improvement of our production environments.
Key Responsibilities
- Design, build, and maintain scalable, reliable, and secure infrastructure.
- Monitor production systems and proactively resolve performance or reliability issues.
- Automate repetitive tasks using scripting and modern DevOps tools.
- Develop and maintain CI/CD pipelines for rapid and safe deployments.
- Implement observability best practices (logging, monitoring, alerting, tracing).
- Participate in on-call rotations, incident response, and post-mortem analysis.
- Collaborate with developers to design systems that are resilient and easy to operate.
- Ensure compliance with security, availability, and disaster recovery standards.
Required Skills & Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
- 3+ years of experience as an SRE, DevOps, or related engineering role.
- Strong knowledge of Linux/Unix systems administration.
- Proficiency with cloud platforms (AWS, GCP, or Azure).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Familiarity with IaC tools (Terraform, Ansible, or similar).
- Strong scripting skills (Python, Bash, Go, or similar).
- Hands-on experience with monitoring tools (Prometheus, Grafana, ELK, Datadog, etc.).
- Understanding of networking concepts, load balancing, DNS, and security best practices.
Job Type: Full-time
Pay: ﷼15, ﷼20,000.00 per month
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Title: Site Reliability Engineer
Company: Testcrew
Status: Open
Designation: Site Reliability Engineer
Department: Cloud & SRE Services - CRE
Years of Experience: 3
Educational Background: Bachelor's Degree (4-5 year degree)
Job Fulfillment Deadline:
We are seeking an experienced Site Reliability Engineer (SRE) with expertise in Application Performance Management (APM) tools such as Dynatrace and AppDynamics. The ideal candidate will focus on ensuring the reliability, availability, and performance of our applications and systems. You will collaborate with development and operations teams to build and maintain scalable infrastructure, automate processes, and implement robust monitoring and alerting solutions.Key Responsibilities:System Reliability and Performance: Design, implement, and maintain high-availability infrastructure to support our applications and services.Application Performance Management: Utilize APM tools like Dynatrace and AppDynamics to monitor and optimize application performance.Incident Management: Respond to and resolve incidents, conduct post-mortems, and implement solutions to prevent future issues.Automation: Develop and implement automation tools and frameworks to enhance operational efficiency.Monitoring and Alerting: Create and maintain comprehensive monitoring and alerting systems to ensure the health and performance of applications and infrastructure.Collaboration: Work with development teams to ensure the reliability and performance of new features and releases.Capacity Planning: Analyze current infrastructure usage and plan for future capacity needs.Documentation: Maintain detailed and accurate documentation of system architecture, processes, and procedures.Continuous Improvement: Identify and implement best practices for scalability, reliability, and security.Qualifications:Education: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).Experience: 5+ years of experience in a Site Reliability Engineer or DevOps role.Location: Onsite , Arabic speaking preferredTechnical Skills:Proficiency with APM tools such as Dynatrace, AppDynamics, New Relic.Experience with cloud platforms (AWS, GCP, Azure).Strong scripting and automation skills (Python, Bash, etc.).Familiarity with configuration management tools (Ansible, Puppet, Chef).Knowledge of containerization and orchestration (Docker, Kubernetes).Understanding of monitoring and logging tools (Prometheus, Grafana, ELK stack).In-depth knowledge of networking concepts and protocols.Soft Skills:Excellent problem-solving and analytical skills.Strong communication and collaboration abilities.Ability to work in a fast-paced, dynamic environment.
Skills Required:- Application performance monitoring
Project: PROJ-0124
Talent Requisition ID: JR
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer
We are looking for a highly skilled Site Reliability Engineer with strong expertise in Kubernetes, ELK stack, OpenStack, and Linux. The role involves building and maintaining reliable infrastructure, ensuring observability, and supporting production environments with automation and strong troubleshooting skills.
Key Responsibilities
- Deploy, manage, and scale Kubernetes clusters and containerized workloads.
- Design and implement logging and monitoring solutions using the ELK stack (Elasticsearch, Logstash, Kibana).
- Operate and maintain OpenStack cloud environments.
- Perform advanced Linux administration (troubleshooting, patching, optimization).
- Automate infrastructure provisioning and management using Terraform, Helm, and Ansible.
- Build and maintain CI/CD pipelines to support development and production deployments.
- Ensure system reliability, scalability, performance, and security across environments.
- Participate in on-call rotations, providing incident response and root cause analysis.
Skills and Competencies
- Proven hands-on experience with Kubernetes and Docker.
- Strong knowledge of the ELK stack (Elasticsearch, Logstash, Kibana).
- Expertise in managing OpenStack environments (STC OpenStack cloud preferred).
- Advanced proficiency in Linux administration (RHEL, Ubuntu).
- Strong scripting skills in Python and Bash.
- Experience with CI/CD tools (GitLab preferred).
- Familiarity with cloud platforms (AWS, GCP, or Azure).
- Experience with Prometheus/Grafana for monitoring.
- Understanding of Kubernetes and Linux security hardening best practices.
Qualifications
- 6–10 years of experience in Linux and Cloud Technologies
- Bachelor's degree in Computer Science, Information Technology, or related field
Relevant Certification (preferred):
CKA, CKS (Certified Kubernetes Administrator / Security Specialist)
- Linux Professional Institute Certification (LPIC-2 or LPIC-3) or Red Hat Certified Engineer (RHCE)
- OpenStack Administrator Certification
- HashiCorp Terraform Associate Certification
- GCP/AWS/Azure Solutions Architect or SysOps certifications
Halian Group
:
With over 28 years of experience, we have come to understand that innovation is the only way to provide agile, practical solutions that transform businesses and careers. Our resourcing and smart services help you to realize tomorrow's potential. Discover the amazing things possible when you bring the right people and the right technologies together.
At Halian, we recognize that diversity, equity, and inclusion (DEI) are essential to building high-performing teams for our clients. We are committed to connecting organizations with top talent from all backgrounds, ensuring that every individual feels valued, respected, and empowered to contribute their unique perspectives. We encourage applications from all qualified candidates, regardless of race, gender, disability, or any other characteristic that makes them unique. By fostering diverse and inclusive workplaces, we help our clients drive innovation, enhance collaboration, and better reflect the communities they serve.
Site Reliability Engineer in Riyadh, Saudi Arabia
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science, headquartered in Riyadh, Saudi Arabia. It is committed to realising Vision 2030 and has a proven track record of supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
We are in an exciting stage of scaling the company to provide AI-powered products and solutions both locally and globally, ensuring the growth and prosperity of our digital society. It is an exciting time to work in the field of AI to create a lasting impact.
About the roleWe are looking for a Senior Site Reliability Engineer to join our team. In this role, you will help ensure our systems are operational and secure, manage our enterprise applications, and maintain our networks.
What you'll do- Combine software engineering, system architecture design, and operational tasks.
- Participate in team meetings, sprint planning, and collaborate as part of the Engineering team on projects.
- Design, build, support, and scale cloud and on-premise infrastructure, including monitoring, alerting, and debugging.
- Implement continuous integration and deployment workflows with best practices in testing, linting, and dependency management.
- Maintain data stores, monitor load, and design backup, restore, scaling, and clustering strategies.
- Collaborate with other departments such as product and data science to address their use cases.
- Explore and adopt new technologies to enhance our current stack.
- Install and configure servers and network equipment using infrastructure as code techniques.
- Practice sustainable incident response and conduct blameless postmortems.
- BSc/Ba in Computer Engineering, Computer Science, or a related field.
- At least 5 years of experience in SRE, DevOps, or infrastructure engineering roles.
- Professional certifications are a plus.
- Experience with container runtimes and orchestrators like Docker and Kubernetes.
- Experience with major cloud providers such as AWS, Azure, GCP, or Oracle.
- Proficiency in infrastructure as code languages like Python and Golang.
- Experience with Linux servers and bash scripting.
- Experience with automating deployment pipelines.
- Strong networking knowledge.
- Knowledge of big data platforms like Kafka, Hadoop, and Spark is a plus.
- Knowledge of SQL and database management is a plus.
- Experience with Terraform or Ansible is a plus.
Join us during an exciting time for the Middle East in a high-growth environment. We offer significant responsibility and trust, competitive compensation, top-tier health insurance, and a culture that empowers you to excel. Work alongside some of the greatest minds in AI in a fun and dynamic workplace that values diversity and individuality.
#J-18808-LjbffrSenior Site Reliability Engineer
Posted today
Job Viewed
Job Description
Who Are We
HALA is a leading fintech player in the MENAP region that aims to redefine financial services and build the future bank of SMEs. HALA aims at empowering SMEs to start, run, and grow their businesses by providing them with cutting-edge financial and technological tools.
HALA currently holds multiple entities in UAE, Saudi Arabia and Egypt (including HALA Payments, HALA Cashier and HALA Logistics) and offers solutions that enable merchants to digitize their payments as well as manage their sales and operations.
Founded in 2017, HALA is currently duly licensed by the Saudi Arabian Central Bank as well as the Financials Services Regulatory Authority (FSRA) in Abu Dhabi Global Market.
Job Summary:
Result-oriented Site Reliability Engineer with 2–4 years of experience in maintaining and improving the reliability, scalability, and performance of complex distributed systems. Proficient in using industry-leading tools and technologies such as Kubernetes, CI/CD pipelines (ArgoCD, FluxCD, Jenkins, GitLab) to optimize infrastructure and automate operational tasks. Skilled in incident management, monitoring, and deployment automation. Strong problem-solving and collaboration abilities, ensuring seamless operations for high-traffic FinTech applications at HALA.
Job Responsibilities
:
- Designing and implementing scalable and reliable infrastructure, ensuring high availability and optimal performance of HALA applications
- Collaborating with development teams to integrate reliability and resilience into the software development lifecycle.
- Conducting post-incident reviews and root cause analysis to identify areas of improvement and prevent future system failures.
- Implementing monitoring and alerting systems to proactively identify and address potential issues.
- Automating routine tasks to streamline operations and reduce manual intervention, contributing to overall system efficiency.
- Participating in on-call rotations to provide 24/7 support and quick resolution of critical incidents.
- Ensuring compliance with industry standards and best practices in system reliability and security.
What We Offer You
We believe you will love working at HALA
- We have an inclusive and diverse culture that encourages innovation and flexibility in remote, in-office, and hybrid work setups.
- We offer highly competitive compensation packages, including the potential for shares.
- We prioritize personal development and offer regular training and an annual learning stipend to tackle new challenges and grow your career in a hyper-growth environment.
- Join a talented team of over 30 nationalities working in 7 countries and gain valuable experience in an exciting industry.
- We offer autonomy, mentoring, and challenging goals that create incredible opportunities for both you and the company.
- You will be given a lot of responsibility and trust. We believe that the best results come when the people responsible for a function are given the freedom to do what they think is best.
If you think you have what it takes to join a remarkable team #apply_now
Be The First To Know
About the latest Site reliability engineer Jobs in Saudi Arabia !
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
About Mozn
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science headquartered in Riyadh, Saudi Arabia and it's working to realise Vision 2030 with a proven track record of excellence in supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
We are in an exciting stage of scaling the company to provide AI-powered products and solutions both locally and globally that ensure the growth and prosperity of our digital humanity. It is an exciting time to work in the field of AI to create a long-lasting impact.
About The Role
- We are hiring a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and efficiency of AI-powered systems for our clients. This role focuses on building resilient infrastructure across cloud and on-prem environments, with expertise in containerization, virtualization, and system automation.
What You'll Do
- Designing and implementing reliable infrastructures to support AI workloads across cloud and on-prem environments
- Managing cloud deployments, VM setups, and on-premises infrastructure build-outs
- Deploying and maintaining containerized applications using Docker and Kubernetes
- Automating infrastructure provisioning using tools such as Terraform or Ansible
- Developing and managing observability frameworks for monitoring, logging, and alerting
- Ensuring system security, scalability, and compliance with Saudi regulations
- Collaborating with software and data engineering teams to build fault-tolerant AI solutions
- Leading incident response, root cause analysis, and establish SRE standards
Qualifications
- Bachelor's or Master's degree in Computer Science, IT, or related field
- 4+ years of experience in site reliability engineering, DevOps, or infrastructure engineering
- Strong experience in Saudi Arabia, delivering consultation services across public and private sectors
- Expertise in cloud platforms (AWS, GCP, OCI, Azure) and hybrid deployments.
- Proficiency in Docker, Kubernetes, and VM setup
- Hands-on experience with on-premises infrastructure setup and optimization
- Familiarity with monitoring platforms (Prometheus, Grafana, ELK, Datadog, etc)
- Strong scripting skills (Python, Bash, or similar) for automation
- Excellent stakeholder communication and cross-team collaboration skills
Benefits
- You will be at the forefront of an exciting time for the Middle East, joining a high-growth rocket-ship in an exciting space
- You will be given a lot of responsibility and trust. We believe that the best results come when the people responsible for a function are given the freedom to do what they think is best
- The fundamentals will be taken care of: competitive compensation, top-tier health insurance, and an enabling culture so that you can focus on what you do best
- You will enjoy a fun and dynamic workplace working alongside some of the greatest minds in AI
- We believe strength lies in difference, embracing all for who they are and empowered to be the best version of themselves
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure.
You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale.
You'll be hands-on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self-healing environment.
This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost-efficiency in production.
- Bachelor's degree in Computer Science, Engineering, or a related field — or equivalent work experience.
- Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
- Build self-healing, auto-scaling systems that minimize manual intervention and ensure uptime.
- Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) within Kubernetes environments.
- Implement backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets.
- Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues).
- Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
- Secure and scale object storage platforms (e.g., MinIO/S3-compatible) for high-throughput data pipelines.
- Manage block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) for resilience and cost balance.
- Collaborate with teams to optimize networking, ingress/egress traffic, and service mesh for secure communication.
- Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
- Build self-healing, auto-scaling systems that minimize toil and manual intervention.
- Optimize networking, ingress/egress traffic control, and service mesh for secure & performant communication.
- Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments.
- Own backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets for critical data services.
- Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
- Troubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues).
- Secure and scale object storage platforms (e.g., MinIO/S3-compatible) and integrate with workloads for high-throughput data pipelines.
- Work with block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost.
- Champion GitOps and CI/CD best practices (ArgoCD, Flux, GitHub Actions).
Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators. - Reduce release risk through progressive delivery strategies (blue/green, canary, spot instance rolling updates).
- Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch).
- Lead incident management and postmortems to prevent recurrence.
- Provide real-time visibility into system health, performance, and cost metrics.
- Implement least-privilege IAM policies, secure service-to-service communication, and network ACLs/firewalls.
- Enforce Kubernetes RBAC, secret management, and secure image supply chain.
- Participate in audit readiness and compliance efforts.
- Analyze and tune system performance under scale (CPU/memory/IO).
- Partner with product and platform teams to right-size clusters, databases, and storage tiers.
Introduce cost visibility dashboards for engineering leadership.
Preferred Qualifications- Experience managing mission-critical systems at scale (high traffic, multi-region).
- Proven cost optimization in cloud/K8s environments.
- Familiarity with service mesh (Istio, Linkerd) or advanced networking/egress control.
- Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required.
Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.
Requirements- 8+ years in SRE / DevOps / Infrastructure Engineering roles.
- Deep Kubernetes expertise (multi-cluster, Helm chart development, advanced networking).
- Strong GitOps workflows using ArgoCD/Flux.
- Expertise with AWS (preferred) or Azure/GCP, plus Infrastructure-as-Code (Terraform, Pulumi, CloudFormation).
- Advanced knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
- Scripting/automation skills in Python, Bash, or Go.
- Solid background in monitoring/observability (Prometheus, Grafana, Loki, ELK/Opensearch, VictoriaMetrics).
Experience with CI/CD at scale and managing production incidents.
Experience with streaming/messaging (Kafka, RabbitMQ, or similar).
- Comprehensive Training & Development programs.
- Performance-based Bonus incentives.
- Flexible Work From Home options.
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are seeking an experienced
Site Reliability Engineer (SRE)
with expertise in Application Performance Management (APM) tools such as Dynatrace and AppDynamics. The ideal candidate will focus on ensuring the reliability, availability, and performance of our applications and systems. You will collaborate with development and operations teams to build and maintain scalable infrastructure, automate processes, and implement robust monitoring and alerting solutions.
Key Responsibilities:
- System Reliability and Performance: Design, implement, and maintain high-availability infrastructure to support our applications and services.
- Application Performance Management: Utilize APM tools like Dynatrace and AppDynamics to monitor and optimize application performance.
- Incident Management: Respond to and resolve incidents, conduct post-mortems, and implement solutions to prevent future issues.
- Automation: Develop and implement automation tools and frameworks to enhance operational efficiency.
- Monitoring and Alerting: Create and maintain comprehensive monitoring and alerting systems to ensure the health and performance of applications and infrastructure.
- Collaboration: Work with development teams to ensure the reliability and performance of new features and releases.
- Capacity Planning: Analyze current infrastructure usage and plan for future capacity needs.
- Documentation: Maintain detailed and accurate documentation of system architecture, processes, and procedures.
- Continuous Improvement: Identify and implement best practices for scalability, reliability, and security.
Qualifications:
- Education: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Excellent English & Arabic Communication skills.
- Experience: 5+ years of experience in a Site Reliability Engineer or DevOps role.
Technical Skills:
- Proficiency with APM tools such as Dynatrace, AppDynamics, New Relic.
- Experience with cloud platforms (AWS, GCP, Azure).
- Strong scripting and automation skills (Python, Bash, etc.).
- Familiarity with configuration management tools (Ansible, Puppet, Chef).
- Knowledge of containerization and orchestration (Docker, Kubernetes).
- Understanding of monitoring and logging tools (Prometheus, Grafana, ELK stack).
- In-depth knowledge of networking concepts and protocols.
Soft Skills:
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration abilities.
- Ability to work in a fast-paced, dynamic environment.