Interswitch Limited is an integrated payment and transaction processing company that provides technology integration, advisory services, transaction processing ...
AjoCard is a financial technology company focused on making it easy for the underserved and unbanked to manage their money and make payments securely and ...
Deimos is a Cloud-native Developer and Security Operations technology services company. We help companies of all sizes adopt the Cloud for improved service ...
Interswitch Limited is an integrated payment and transaction processing company that provides technology integration, advisory services, transaction processing ...
Ascentech Services Ltd acts as a gateway to provide a wide range of recruitment and selection services to companies. We are a dedicated team of professional ...
At Renmoney, we believe finance should be simple, useful and accessible to everyone. That’s what makes us really passionate about leveraging data driven ...
At Renmoney, we believe finance should be simple, useful and accessible to everyone. That’s what makes us really passionate about leveraging data driven ...
Termii is a communications platform that allows African businesses to send messages to anyone across SMS, email, voice, and instant messaging channels. With ...
Tezza”(te-zza) from the Italian word "Completezza” embodies our commitment to providing IT and Business Solutions that are comprehensive, through ...
ENGIE is a leading global energy company that builds its businesses around a model based on responsible growth to take on energy transition challenges. We ...
A site reliability engineer is a unique role that requires either a background as a sysadmin, a software developer with additional operations experience, or someone in an IT operations role that also has software development skills. SRE teams are responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production. SRE teams determine the launch of new features by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).
Building software to help DevOps, ITOps & support teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Fixing support escalation issues
Similar to the point above, a site reliability engineer can expect to spend time fixing support escalation cases. But, as your SRE operations mature, your systems will become more reliable and you’ll see fewer critical incidents in production – leading to fewer support escalations.
Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.
Optimizing on-call rotations & processes
More times than not, site reliability engineers will need to take on-call responsibilities. At most organizations, the SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes.
SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.
Documenting “tribal” knowledge
SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.
Conducting post-incident reviews
Without thorough post-incident reviews, you have no way to identify what’s working and what’s not. SRE teams need to keep teams honest and ensure that everyone — software developers and IT professionals — are conducting post-incident reviews, documenting their findings and taking action on their learnings.
Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.
Coding languages: As an SRE, you will need to be proficient in at least one coding language. This is because you will often be required to write code in order to automate tasks or build tools. The most popular coding languages among SREs are Python, Java, and Go.
CI/CD pipeline development: In order to release code changes safely and efficiently, you will need to be well-versed in continuous integration (CI) and continuous delivery (CD) pipelines.
Mastered distributed computing: Many companies today use distributed systems in order to achieve high availability and scalability. As an SRE, you will need to have a deep understanding of how distributed systems work in order to be able to troubleshoot and optimize them.
Using Monitoring tools: Monitoring is essential for keeping track of the health of company services and products. As an SRE, you should be familiar with various monitoring tools such as Prometheus, Solarwinds, Pingdom, Zabbix, and Zoho.
Using version control tools: Version control tools such as Git are used by developers to share and manage code changes. As an SRE, you will need to be familiar with these tools in order to help developers with code deployments.
Understanding operating systems: To effectively manage company services, you will need to have a deep understanding of various operating systems such as Linux, Windows, and macOS.
Deep understanding of databases: Databases are often used by company services in order to store data. As an SRE, you should have a deep understanding of how different types of databases work in order to be able to effectively troubleshoot any issues that may arise.
Automation skills: Automation is crucial for reducing the amount of manual work that needs to be done in order to maintain company services. As an SRE, you should be proficient in various automation tools such as ACCELQ and Avo Assure.
Knowing cloud-native applications: Cloud-native applications are designed specifically for deployment on cloud platforms such as AWS and Azure. As an SRE, you should have experience working with cloud-native applications to manage them effectively.