28 Site Reliability Engineer jobs in Saudi Arabia
Senior Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
HALA is a leading fintech player in the MENAP region that aims to redefine financial services and build the future bank of SMEs. HALA aims at empowering SMEs to start, run, and grow their businesses by providing them with cutting-edge financial and technological tools.
HALA currently holds multiple entities in UAE, Saudi Arabia and Egypt (including HALA Payments, HALA Cashier and HALA Logistics) and offers solutions that enable merchants to digitize their payments as well as manage their sales and operations.
Founded in 2017, HALA is currently duly licensed by the Saudi Arabian Central Bank as well as the Financials Services Regulatory Authority (FSRA) in Abu Dhabi Global Market.
Responsibilities:
- Comply with the HALA’s code of conduct and ethics
- Promote the HALA’s vision, mission, values and model desired behaviors
- Promote HALA and spread its culture
- Commit to HALA’s rules and regulations
- Perform tasks as directed in the pursuit of the achievement of organizational goals
- Share with team know-how and encourage their development
Job Specific:
- Run the cloud environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large, distributed software applications
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
- Deploy updates and fixes
- Build tools to reduce occurrences of errors and improve customer experience
- Perform root cause analysis for production errors
- Investigate and resolve technical issues
- Design procedures for system troubleshooting and maintenance
Education:
Bachelor’s degree in computer science, information technology, or equivalent field of studies
The education levels can be replaced by years of experience
Experience:
5-7 years of experience in a similar position (SRE, DevOps, or infrastructure engineer).
Skills:
- Computer Skills: Advanced in Microsoft Office Tools
- Languages: Fluent in English and Arabic
- Advanced knowledge of compliance and regulations
- Experience with Kubernetes administration.
- Experience with infrastructure as code tools such as Terraform and Ansible.
- Experience with at least one of the major cloud providers: AWS, GCP, Azure, or OCI.
- Experience with architecting, developing, and troubleshooting large-scale systems.
- Experience building CI/CD pipelines (preferably GitOps).
- Experience with monitoring and observability tools such as Prometheus, Loki, Jaeger, and Sentry.
- Experience in managing databases including (backup and restore plans, replication, and clustering) such as PostgresSQL, and MongoDB.
- Good networking knowledge (preferably experience with VPNs and Service Mesh)
We believe you will love working at HALA!
- We have an inclusive and diverse culture that encourages innovation and flexibility in remote, in-office, and hybrid work setups.
- We offer highly competitive compensation packages, including the potential for shares.
- We prioritize personal development and offer regular training and an annual learning stipend to tackle new challenges and grow your career in a hyper-growth environment.
- Join a talented team of over 30 nationalities working in 7 countries and gain valuable experience in an exciting industry.
- We offer autonomy, mentoring, and challenging goals that create incredible opportunities for both you and the company.
- You will be given a lot of responsibility and trust.We believe that the best results come when the people responsible for a function are given the freedom to do what they think is best.
Create a Job Alert
Interested in building your career at HALA? Get future opportunities sent straight to your email.
Apply for this job*
indicates a required field
First Name *
Last Name *
Email *
Phone
Resume/CV
Enter manually
Accepted file types: pdf, doc, docx, txt, rtf
Enter manually
Accepted file types: pdf, doc, docx, txt, rtf
LinkedIn Profile
Website
What is your current salary? *
What is your expected salary? *
Are you Saudi? * Select.
What is your nationality? * Select.
What is your notice period? * Select.
Are you living in Riyadh? * Select.
When is your available times for an interview? * Select.
Have you held any leadership positions?If yes, kindly, describe *
If you held any leadership positions, kindly mention, how do you motivate your team? *
What do you know about Hala?And why you want to work for Hala? *
Do you have +4 years experience in this field * Select.
Kindly, write down professional certificates you have *
#J-18808-LjbffrStaff Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
Who Are We
We are Foodics! A leading restaurant management ecosystem and payment tech provider. Founded in 2014 with headquarters in Riyadh and offices across 5 countries, including UAE, Egypt, Jordan, and Kuwait. We serve customers and partners in over 35 countries worldwide. Our innovative products have processed over 6 billion orders, making Foodics one of the most rapidly evolving SaaS companies from the MENA region. Foodics has achieved three funding rounds, with the latest raising $170 million in the largest SaaS funding round in MENA, enhancing our capabilities to serve business owners better.
The Job in a Nutshell
We are seeking a Staff Site Reliability Engineer (SRE) to join our high-impact engineering team. You will ensure the scalability, performance, and reliability of Foodics’ cloud-native platforms. Your role involves designing, implementing, and evolving infrastructure solutions and operational processes supporting millions of transactions daily, while promoting best practices in observability, incident management, and resilience engineering.
What Will You Do
- Design and maintain scalable, highly available, and fault-tolerant systems across cloud providers (AWS, OCI).
- Lead incident response efforts, conduct blameless post-mortems, and drive improvements.
- Build and refine automated deployment pipelines for safe and repeatable changes.
- Implement observability frameworks (metrics, tracing, logging) to detect and resolve performance issues proactively.
- Collaborate with development teams to embed reliability into the software lifecycle.
- Optimize infrastructure costs while maintaining service quality.
- Drive chaos engineering experiments to validate system resilience.
- Document architecture, runbooks, and operational processes for internal and cross-team use.
What Are We Looking For
We seek a reliability-focused engineer with strong technical skills, experienced in solving operational challenges at scale. You should be hands-on with distributed systems, cloud-native platforms, and automation tools.
- Strong knowledge of SRE principles (SLIs, SLOs, SLAs) and operational excellence.
- Experience with Kubernetes, container orchestration, and service mesh technologies.
- Expertise in infrastructure as code (Terraform, Ansible, Crossplane is optional) and scripting (Bash, Python, Go).
- Deep understanding of monitoring and alerting systems (Prometheus / Grafana, ELK, Loki, Datadog, AWS CloudWatch).
- Skills in cloud networking, load balancing, API gateways (NGINX, Kong, AWS API Gateway).
- Experience with relational and NoSQL databases (MySQL, PostgreSQL, MongoDB, DocumentDB, Redis).
- Familiarity with distributed tracing (Jaeger, OpenTelemetry) and chaos testing frameworks.
- Excellent troubleshooting skills, capable of resolving high-impact incidents under pressure.
Who Will Excel
- Candidates with experience operating high-traffic, mission-critical cloud-native platforms.
- Those demonstrating strong collaboration and communication skills across teams.
- Individuals with a data-driven approach to performance tuning and capacity planning.
- Candidates thriving in fast-paced, high-growth SaaS environments and committed to continuous improvement.
What We Offer You
We believe you will love working at Foodics!
- Competitive compensation packages, including bonuses and potential equity.
- Annual learning stipend and regular training opportunities.
- Exposure to cutting-edge cloud technologies and distributed systems.
- A diverse, global team of over 30 nationalities in 14 countries.
- Autonomy, challenging goals, and the opportunity to impact platform reliability serving millions.
Staff Site Reliability Engineer
Posted 9 days ago
Job Viewed
Job Description
Who Are We
We Are Foodics! a leading restaurant management ecosystem and payment tech provider. Founded in 2014 with headquarter in Riyadh and offices across 5 countries, including UAE, Egypt, Jordan and Kuwait. We are currently serving customers and partners in over 35 different countries worldwide. Our innovative products have successfully processed over 6 billion (yes, billion with a B) orders so far! making Foodics one of the most rapidly evolving SaaS companies to ever emerge from the MENA region. Also Foodics has achieved three rounds of funding, with the latest raising $170 million in the largest SaaS funding round in MENA, boosting its innovation capabilities to better serve business owners.
The Job in a Nutshell
We are seeking a Staff Site Reliability Engineer (SRE) to join our high-impact engineering team. In this role, you will be responsible for ensuring the scalability, performance, and reliability of Foodics’ cloud-native platforms and services. You will design, implement, and evolve infrastructure solutions and operational processes that support millions of transactions daily, while championing best practices in observability, incident management, and resilience engineering. Your expertise will help us maintain world-class uptime and seamless customer experiences as we continue to grow at scale.
What Will You Do
- Design and maintain scalable, highly available, and fault-tolerant systems across multiple cloud providers (AWS, OCI).
- Lead incident response efforts, conducting blameless post-mortems and driving systemic improvements.
- Build and refine automated deployment pipelines, ensuring fast, safe, and repeatable delivery of changes.
- Implement robust observability frameworks (metrics, tracing, logging) to proactively detect and address performance issues.
- Collaborate with development teams to embed reliability into every stage of the software lifecycle.
- Optimize infrastructure costs while maintaining service quality.
- Drive chaos engineering experiments to validate system resilience.
- Document architecture, runbooks, and operational processes for internal and cross-team use.
What Are We Looking For
We’re looking for a reliability-focused engineer with strong technical depth, who thrives in solving complex operational challenges at scale. You must be hands-on with distributed systems, cloud-native platforms, and automation tools.
- Strong background in SRE principles (SLIs, SLOs, SLAs) and operational excellence.
- Experience with Kubernetes, container orchestration, and service mesh technologies.
- Proven expertise in infrastructure as code (Terraform, Ansible, Crossplane is optional) and automation scripting (Bash, Python, Go).
- Deep understanding of monitoring and alerting systems (Prometheus/Grafana, ELK, Loki, Datadog, AWS CloudWatch).
- Skilled in cloud networking, load balancing, API gateway management (NGINX, Kong, AWS API GW).
- Solid experience with relational and NoSQL databases in production (MySQL/PostgreSQL, MongoDB, DocumentDB, Redis).
- Familiarity with distributed tracing (Jaeger, OpenTelemetry) and chaos testing frameworks.
- Excellent troubleshooting skills and ability to resolve high-impact incidents under pressure.
Who Will Excel
- Candidates who successfully operated high-traffic, mission-critical platforms in a cloud-native environment.
- Candidates that demonstrate strong collaboration and communication skills across engineering, product, and business teams.
- Candidates who bring a data-driven approach to performance tuning and capacity planning.
- Candidates that thrive in fast-paced, high-growth SaaS environments and embraces continuous improvement.
What We Offer You
We believe you will love working at Foodics!
- Highly competitive compensation packages, including bonuses and potential equity.
- Annual learning stipend and regular training to accelerate your career.
- Exposure to cutting-edge cloud technologies and large-scale distributed systems.
- A truly global team of over 30 nationalities in 14 countries.
- Autonomy, challenging goals, and the chance to directly impact the reliability of platforms serving millions.
Senior Site Reliability Engineer
Posted 9 days ago
Job Viewed
Job Description
Join to apply for the Senior Site Reliability Engineer role at Canonical
2 days ago Be among the first 25 applicants
Join to apply for the Senior Site Reliability Engineer role at Canonical
Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers, and industry leaders in many sectors. The company is a pioneer of global distributed collaboration, with 1200+ colleagues in 75+ countries and very few office based roles. Teams meet two to four times yearly in person, in interesting locations around the world, to align on strategy and execution.
The company is founder led, profitable and growing.
We are hiring a Senior Site Reliability Engineer
Next-gen operations at scale, with pure Python infra-as-code, from bare metal to containers and applications. Our goal is to perfect enterprise infrastructure devops.
We run hundreds of private cloud, Kubernetes, and application clusters for customers across physical and public cloud estate, and we are raising the bar on what's possible with automation by embracing a universal operator pattern and model-driven operations.
To succeed in this role you need to believe in automation as a pure software engineering problem, not a hack-it-till-it-works-for-me problem. You need to be interested in the scientific approach to operations at scale, driven by metrics and code, and you need to be able to learn the entire stack, from bare metal networking and kernel up to serverless and open source applications.
Location: Globally remote role
The role entails
Our cloud operations engineers bring Python software-engineering skills and rigour to the operations domain. We practise devsecops from bare metal to application. We architect and run OpenStack, Kubernetes and software defined storage, and we enable devsecops for applications running on that infrastructure too.
To become a member of this team, you need to be a software engineer fluent in Python, you need a genuine interest in the full open source infrastructure stack from metal to containers, and you need the ability to work in a high pressure operations environment with mission-critical services for global brand name customers.
As a member of the team you will gain experience in a broad range of cloud technologies. We evolve our offerings as the state of the art improves, so you get to stay current with the latest capabilities in open source infrastructure. We drive upgrades to keep our customers on the latest, best solutions.
What we are looking for in you
- Degree in Software Engineering or Computer Science
- Experience with Linux and familiarity with Linux networking and storage
- Python software development expertise
- Operational experience
- Excellent interpersonal skills, curiosity, flexibility, and accountability
- Ability to travel internationally twice a year, for company events up to two weeks long
- Experience with OpenStack or Kubernetes deployment or operations
- Familiarity with public or private cloud management
We consider geographical location, experience, and performance in shaping compensation worldwide. We revisit compensation annually (and more often for graduates and associates) to ensure we recognise outstanding performance. In addition to base pay, we offer a performance-driven annual bonus or commission. We provide all team members with additional benefits, which reflect our values and ideals. We balance our programs to meet local needs and ensure fairness globally.
- Distributed work environment with twice-yearly team sprints in person
- Personal learning and development budget of USD 2,000 per year
- Annual compensation review
- Recognition rewards
- Annual holiday leave
- Maternity and paternity leave
- Employee Assistance Programme
- Opportunity to travel to new locations to meet colleagues
- Priority Pass, and travel upgrades for long haul company events
Canonical is a pioneering tech firm at the forefront of the global move to open source. As the company that publishes Ubuntu, one of the most important open source projects and the platform for AI, IoT and the cloud, we are changing the world of software. We recruit on a global basis and set a very high standard for people joining the company. We expect excellence - in order to succeed, we need to be the best at what we do. Most colleagues at Canonical have worked from home since its inception in 2004. Working here is a step into the future, and will challenge you to think differently, work smarter, learn new skills, and raise your game.
Canonical is an equal opportunity employer
We are proud to foster a workplace free from discrimination. Diversity of experience, perspectives, and background create a better work environment and better products. Whatever your identity, we will give your application fair consideration.
Seniority level
- Seniority level Mid-Senior level
- Employment type Full-time
- Job function Engineering and Information Technology
- Industries Software Development
Referrals increase your chances of interviewing at Canonical by 2x
Get notified about new Senior Site Reliability Engineer jobs in Riyadh, Riyadh, Saudi Arabia .
Junior Software Engineer - Cross-platform C++ - Multipass Software Engineer (Python/Linux/Packaging) Software Engineer - Cross-platform C++ - Multipass System Software Engineer - GCC/LLVM compiler, tooling, and ecosystem Software Engineer - Python - Container Images Distributed Systems Software Engineer, Python / Go Software Engineer - Python - Container Images Software Engineer - Python - Container Images Python and Kubernetes Software Engineer - Data, AI/ML & Analytics Software Engineer - Immutable Ubuntu Desktop Senior Software Engineer - Python/MongoDBWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrSite Reliability Engineer (SRE)
Posted 12 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer (SRE) role at Dicetek LLC .
The Site Reliability Engineer (SRE) is responsible for implementing and maintaining highly reliable and scalable applications and services. The primary goal of an SRE is to ensure smooth operation and performance of applications, minimizing downtime and maximizing user experience.
Requirements include:
- 4+ years of relevant experience in platform monitoring, application performance monitoring, problem solving, incident response, troubleshooting, and post-incident analysis.
Key responsibilities:
- Automation and Tooling: Develop and maintain automation tools, scripts, and frameworks to streamline system deployment, configuration management, monitoring, and incident response. Automate repetitive tasks to minimize manual intervention.
- Monitoring and Alerting: Implement monitoring solutions to proactively detect issues. Set up dashboards and alerts for system health, performance, and availability.
- Incident Response and Troubleshooting: Participate in incident management, conduct root cause analysis, and collaborate with teams to resolve issues efficiently.
- Performance Optimization: Identify bottlenecks and work with development teams to improve application performance.
- Security and Compliance: Collaborate with security teams to implement controls, ensure compliance, and perform security audits.
- Collaboration and Documentation: Foster cross-team collaboration and document system designs, configurations, and procedures.
- Continuous Improvement: Stay updated with industry trends and drive initiatives to enhance system reliability and scalability.
Seniority level: Not Applicable
Employment type: Contract
Job function: Engineering and Information Technology
Industries: IT Services and IT Consulting
Note: This job posting appears active. Referrals can increase your chances of interviewing.
#J-18808-LjbffrSite Reliability Engineer (SRE)
Posted 12 days ago
Job Viewed
Job Description
Job DescriptionJob Description
SRE- (Strong experience in Vulnerability Management and Server Patching )
Location Seattle, WA Onsite 3 days a week
Duration 12 Months (Full-time)
Core skills -
SRE, Vulnerability Management and Server Patching:
-This role focuses on ensuring the security and reliability of server infrastructure through proactive vulnerability management and efficient server patching, leveraging the Brinqa platform.
-The SRE will play a critical role in identifying, prioritizing, and remediating vulnerabilities, as well as coordinating and executing server patching activities.
Key responsibilities:
Identifying and analyzing vulnerabilities within server infrastructure, including operating systems, applications, and network devices, potentially using tools like Qualys.
Collaborating with application owners and other teams to define and maintain patch maintenance windows.
Investigating and resolving issues related to server patching and vulnerability remediation, including coordinating with other teams and escalating issues as necessary.
Ensuring compliance with corporate security policies, industry standards, and regulations related to vulnerability management and patching.
Generating reports on vulnerability status, remediation progress, and overall security posture using tools like Brinqa and Qualys.
Staying up-to-date on the latest vulnerabilities, security threats, and patching best practices.
Defining and assessing patching service levels and success rates.
Please let us know if you have any questions.
Thank you for your business!
Thanks & Regards
Benny
Email:
O: +1 ( | M: +1 (
Flexible work from home options available.
#J-18808-LjbffrSenior Site Reliability Engineer
Posted 19 days ago
Job Viewed
Job Description
Join to apply for the Senior Site Reliability Engineer role at Mozn
Join to apply for the Senior Site Reliability Engineer role at Mozn
About Mozn
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science headquartered in Riyadh, Saudi Arabia and it’s working to realise Vision 2030 with a proven track record of excellence in supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
About Mozn
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science headquartered in Riyadh, Saudi Arabia and it’s working to realise Vision 2030 with a proven track record of excellence in supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
We are in an exciting stage of scaling the company to provide AI-powered products and solutions both locally and globally that ensure the growth and prosperity of our digital humanity. It is an exciting time to work in the field of AI to create a long-lasting impact.
About The Role
We are looking for a Senior Site Reliability Engineer to join our team. In this role, you’ll help ensure our systems are running and secure, manage our various enterprise applications, and maintain our networks.
What You'll Do
- What you do will be a mixture of software engineering, system architecture design, and operation
- You will be a part of the Engineering team working on a project. You will attend morning meetings, sprint planning as an SRE member of the team.
- You will be helping design, build, support and scale our cloud and on-premise infrastructure; Including monitoring, alerting and debugging infrastructure.
- You will design and implement continuous integration and deployment workflows, with best practices in testing linting and dependency management.
- You will maintain our data stores, monitor the load, design and implement backup and restore plans, scaling, clustering (sharding/replication).
- You will be Collaborating and coordinating with other departments (product, data science etc) to solve their use cases
- You will be exploring and learning new technologies that can complement or replace our current stack to improve it.
- You will be installing servers and network equipment and configuring them using infrastructure as code techniques.
- You will practice sustainable incident response and blameless postmortems.
- BSc/Ba in Computer Engineering, Computer Science or a related discipline.
- 5 years of experience in a similar position (SRE, DevOps or infrastructure engineering)
- Professional certifications are appreciated.
- Solid experience with container runtimes and orchestrators: Docker and Kubernetes.
- Experience of at least one of the major cloud providers: AWS, Azure, GCP or Oracle.
- Preferred languages for our infrastructure as code are Python and Golang.
- Experience with linux servers including competency in bash scripting.
- Experience with Infrastructure as code.
- Experience with automating deployment pipelines.
- Solid foundation in networking.
- Knowledge of big data platforms like kafka, hadoop, and spark is a plus.
- Knowledge of SQL and sql database management is plus
- Knowledge of Terraform or Ansible is a plus
- You will be at the forefront of an exciting time for the Middle East, joining a high-growth rocket-ship in an exciting space.
- You will be given a lot of responsibility and trust. We believe that the best results come when the people responsible for a function are given the freedom to do what they think is best.
- The fundamentals will be taken care of: competitive compensation, top-tier health insurance, and an enabling culture so that you can focus on what you do best
- You will enjoy a fun and dynamic workplace working alongside some of the greatest minds in AI.
- We believe strength lies in difference, embracing all for who they are and empowered to be the best version of themselves.
- Seniority level Mid-Senior level
- Employment type Full-time
- Job function Engineering and Information Technology
- Industries Software Development
Referrals increase your chances of interviewing at Mozn by 2x
Get notified about new Senior Site Reliability Engineer jobs in Riyadh, Riyadh, Saudi Arabia .
Java Back-End Developer with Banking ExperienceRiyadh, Riyadh, Saudi Arabia 39 minutes ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrBe The First To Know
About the latest Site reliability engineer Jobs in Saudi Arabia !
Senior Site Reliability Engineer
Posted 19 days ago
Job Viewed
Job Description
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science, headquartered in Riyadh, Saudi Arabia. It is committed to realising Vision 2030 and has a proven track record of supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
We are in an exciting stage of scaling the company to provide AI-powered products and solutions both locally and globally, ensuring the growth and prosperity of our digital society. It is an exciting time to work in the field of AI to create a lasting impact.
About the roleWe are looking for a Senior Site Reliability Engineer to join our team. In this role, you will help ensure our systems are operational and secure, manage our enterprise applications, and maintain our networks.
What you'll do- Combine software engineering, system architecture design, and operational tasks.
- Participate in team meetings, sprint planning, and collaborate as part of the Engineering team on projects.
- Design, build, support, and scale cloud and on-premise infrastructure, including monitoring, alerting, and debugging.
- Implement continuous integration and deployment workflows with best practices in testing, linting, and dependency management.
- Maintain data stores, monitor load, and design backup, restore, scaling, and clustering strategies.
- Collaborate with other departments such as product and data science to address their use cases.
- Explore and adopt new technologies to enhance our current stack.
- Install and configure servers and network equipment using infrastructure as code techniques.
- Practice sustainable incident response and conduct blameless postmortems.
- BSc/Ba in Computer Engineering, Computer Science, or a related field.
- At least 5 years of experience in SRE, DevOps, or infrastructure engineering roles.
- Professional certifications are a plus.
- Experience with container runtimes and orchestrators like Docker and Kubernetes.
- Experience with major cloud providers such as AWS, Azure, GCP, or Oracle.
- Proficiency in infrastructure as code languages like Python and Golang.
- Experience with Linux servers and bash scripting.
- Experience with automating deployment pipelines.
- Strong networking knowledge.
- Knowledge of big data platforms like Kafka, Hadoop, and Spark is a plus.
- Knowledge of SQL and database management is a plus.
- Experience with Terraform or Ansible is a plus.
Join us during an exciting time for the Middle East in a high-growth environment. We offer significant responsibility and trust, competitive compensation, top-tier health insurance, and a culture that empowers you to excel. Work alongside some of the greatest minds in AI in a fun and dynamic workplace that values diversity and individuality.
#J-18808-LjbffrSenior Site Reliability Engineer
Posted today
Job Viewed
Job Description
Mozn is a rapidly growing technology firm revolutionising the field of Artificial Intelligence and Data Science, headquartered in Riyadh, Saudi Arabia. It is committed to realising Vision 2030 and has a proven track record of supporting and growing the tech ecosystem in Saudi Arabia and the GCC region. Mozn is the trusted AI technology partner for some of the largest government organizations, as well as many large corporations and startups.
We are in an exciting stage of scaling the company to provide AI-powered products and solutions both locally and globally, ensuring the growth and prosperity of our digital society. It is an exciting time to work in the field of AI to create a lasting impact.
About the roleWe are looking for a Senior Site Reliability Engineer to join our team. In this role, you will help ensure our systems are operational and secure, manage our enterprise applications, and maintain our networks.
What you'll do- Combine software engineering, system architecture design, and operational tasks.
- Participate in team meetings, sprint planning, and collaborate as part of the Engineering team on projects.
- Design, build, support, and scale cloud and on-premise infrastructure, including monitoring, alerting, and debugging.
- Implement continuous integration and deployment workflows with best practices in testing, linting, and dependency management.
- Maintain data stores, monitor load, and design backup, restore, scaling, and clustering strategies.
- Collaborate with other departments such as product and data science to address their use cases.
- Explore and adopt new technologies to enhance our current stack.
- Install and configure servers and network equipment using infrastructure as code techniques.
- Practice sustainable incident response and conduct blameless postmortems.
- BSc/Ba in Computer Engineering, Computer Science, or a related field.
- At least 5 years of experience in SRE, DevOps, or infrastructure engineering roles.
- Professional certifications are a plus.
- Experience with container runtimes and orchestrators like Docker and Kubernetes.
- Experience with major cloud providers such as AWS, Azure, GCP, or Oracle.
- Proficiency in infrastructure as code languages like Python and Golang.
- Experience with Linux servers and bash scripting.
- Experience with automating deployment pipelines.
- Strong networking knowledge.
- Knowledge of big data platforms like Kafka, Hadoop, and Spark is a plus.
- Knowledge of SQL and database management is a plus.
- Experience with Terraform or Ansible is a plus.
Join us during an exciting time for the Middle East in a high-growth environment. We offer significant responsibility and trust, competitive compensation, top-tier health insurance, and a culture that empowers you to excel. Work alongside some of the greatest minds in AI in a fun and dynamic workplace that values diversity and individuality.
#J-18808-LjbffrStaff Site Reliability Engineer
Posted today
Job Viewed
Job Description
Who Are We
We Are Foodics! a leading restaurant management ecosystem and payment tech provider. Founded in 2014 with headquarter in Riyadh and offices across 5 countries, including UAE, Egypt, Jordan and Kuwait. We are currently serving customers and partners in over 35 different countries worldwide. Our innovative products have successfully processed over 6 billion (yes, billion with a B) orders so far! making Foodics one of the most rapidly evolving SaaS companies to ever emerge from the MENA region. Also Foodics has achieved three rounds of funding, with the latest raising $170 million in the largest SaaS funding round in MENA, boosting its innovation capabilities to better serve business owners.
The Job in a Nutshell
We are seeking a Staff Site Reliability Engineer (SRE) to join our high-impact engineering team. In this role, you will be responsible for ensuring the scalability, performance, and reliability of Foodics’ cloud-native platforms and services. You will design, implement, and evolve infrastructure solutions and operational processes that support millions of transactions daily, while championing best practices in observability, incident management, and resilience engineering. Your expertise will help us maintain world-class uptime and seamless customer experiences as we continue to grow at scale.
What Will You Do
- Design and maintain scalable, highly available, and fault-tolerant systems across multiple cloud providers (AWS, OCI).
- Lead incident response efforts, conducting blameless post-mortems and driving systemic improvements.
- Build and refine automated deployment pipelines, ensuring fast, safe, and repeatable delivery of changes.
- Implement robust observability frameworks (metrics, tracing, logging) to proactively detect and address performance issues.
- Collaborate with development teams to embed reliability into every stage of the software lifecycle.
- Optimize infrastructure costs while maintaining service quality.
- Drive chaos engineering experiments to validate system resilience.
- Document architecture, runbooks, and operational processes for internal and cross-team use.
What Are We Looking For
We’re looking for a reliability-focused engineer with strong technical depth, who thrives in solving complex operational challenges at scale. You must be hands-on with distributed systems, cloud-native platforms, and automation tools.
- Strong background in SRE principles (SLIs, SLOs, SLAs) and operational excellence.
- Experience with Kubernetes, container orchestration, and service mesh technologies.
- Proven expertise in infrastructure as code (Terraform, Ansible, Crossplane is optional) and automation scripting (Bash, Python, Go).
- Deep understanding of monitoring and alerting systems (Prometheus/Grafana, ELK, Loki, Datadog, AWS CloudWatch).
- Skilled in cloud networking, load balancing, API gateway management (NGINX, Kong, AWS API GW).
- Solid experience with relational and NoSQL databases in production (MySQL/PostgreSQL, MongoDB, DocumentDB, Redis).
- Familiarity with distributed tracing (Jaeger, OpenTelemetry) and chaos testing frameworks.
- Excellent troubleshooting skills and ability to resolve high-impact incidents under pressure.
Who Will Excel
- Candidates who successfully operated high-traffic, mission-critical platforms in a cloud-native environment.
- Candidates that demonstrate strong collaboration and communication skills across engineering, product, and business teams.
- Candidates who bring a data-driven approach to performance tuning and capacity planning.
- Candidates that thrive in fast-paced, high-growth SaaS environments and embraces continuous improvement.
What We Offer You
We believe you will love working at Foodics!
- Highly competitive compensation packages, including bonuses and potential equity.
- Annual learning stipend and regular training to accelerate your career.
- Exposure to cutting-edge cloud technologies and large-scale distributed systems.
- A truly global team of over 30 nationalities in 14 countries.
- Autonomy, challenging goals, and the chance to directly impact the reliability of platforms serving millions.