Site Reliability Engineer
Trinidad & Tobago
At Simply Intense we’re passionate about building software that solves big problems. We count on our site reliability engineers (SREs) to empower our users with a rich feature set, high availability, and stellar performance level to pursue their missions. As we expand our customer deployments, we are currently seeking an experienced SRE to deliver insights from massive-scale data in real-time. Specifically, we are searching for someone who brings fresh ideas, demonstrates a unique and informed viewpoint, and enjoys collaborating with a cross-functional, global team to develop real-world solutions and positive user experiences at every interaction.
Objectives of this Role:
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Build/configure tools to manage platform infrastructure and applications.
- Improve reliability, quality of our software platform.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
- Provide primary operational support for a large distributed software application.
Daily and Monthly Responsibilities:
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with well-defined service level objectives.
Required Skills and Qualifications:
- Previous success in technical engineering.
- Bachelor’s degree in computer science or other highly technical, scientific discipline.
- Experience with Kubernetes cluster management platform.
- Experience with AWS services (EKS, S3, EC2, ALB, CloudFront, RDS, Elasticache, etc.).
- Experience with monitoring and configuring Datadog for observability.
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.