SRE

Location: Hyderabad / Secunderabad, Telangana (Telangana)

Job Overview

Experience: 7.0 - 12.0 Years

Salary: 7 - 12 India-INR L.P.A.

Qualification: 0

Specialization: 0

Course: 0

Other Course: Graduate - Qualification Mentioned in Job Description

Other Specialization: 0

Posted On: Jun 11, 2025

Valid Upto: Jul 11, 2025

Job Type: PermanentJob

Gender: Both Male & Female

Function:

PWD: 0

Job Description

Job Summary :

Job Title: Site Reliability Engineer (SRE) - Senior Engineer

Experience Level

Minimum 7 years of work experience as an SRE (not Traditional Production Support). Minimum of 6-8 years work experience in critical production environments.

Key Responsibilities

As a Site Reliability Engineer (SRE), you will:

  • Platform Support & Reliability: Work as part of a 24/7 on-desk team in shifts, managing middleware and associated applications consumed globally, covering incident, change, event, and problem management. Be the guardian to ensure high reliability of applications, middleware, storage platforms, schedulers (and their jobs), and underlying cloud infrastructure.
  • Cloud Infrastructure Management: Assess and enhance cloud infrastructure and data pipeline resilience. Provide GCP (Google Cloud) and private-cloud operational support/administration activities such as provisioning, capacity management, reliability management, monitoring, and restoration. Manage Kubernetes cluster management, monitoring, and remediation.
  • Observability & Monitoring: Set up and configure an observability product (preferably AppDynamics or Splunk) for end-to-end traceability and log analytics. Define SLIs and configure SLOs, respond to threshold alerts, and continuously optimize monitoring capability to reduce noise. Set up anomaly detection and auto-remediation workflows.
  • Automation & CI/CD: Develop coding/automation scripting (particularly for integration tier and middleware) to automate deployments and script self-healing workflows based on telemetry. Work with CI/CD toolchains, setting up and running deployment pipelines and propagating changes on different environments. Automate new change rollouts and validate/automate patch rollout processes.
  • Troubleshooting & Debugging: Debug integrations and consumers at the code level. Work with code as well as configuration artifacts to debug and fix issues that may arise. Diagnose and debug systems at the application level.
  • Toil Reduction: Eliminate toil by lowering incident volume, eliminating noise from alerts, automating manual processes, and converting workarounds into system features.
  • Collaboration & Engagement: Work with Development, QA, and other squads to design, build, and roll out reliability features into applications. Engage in on-call and critical operations support activities while leading blameless post-mortems. Direct liaison with customers for stakeholder management.

Mandatory Skills & Experience

Technical Proficiency:

  • SRE & DevOps Experience: Minimum 7 years of work experience as an SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments.
  • Cloud Platforms: Working experience with GCP (Google Cloud), particularly with GKE (Google Kubernetes Engine). Experience with AWS Cloud Infra operations on production is also needed.
  • Coding/Scripting/Automation: Proficiency in coding/automation scripting in any programming language (e.g., Python scripts, Ansible templates), particularly for integration tier and middleware. Ability to work with code and configuration artifacts to debug and fix issues.
  • Containerization & Orchestration: Knowledge of Docker is important. Kubernetes cluster management, monitoring, and remediation.
  • Middleware: Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat).
  • Data Storage: Maintaining Hazelcast Data storage platform clusters.
  • Job Scheduling: Experience with Control M job schedulers.
  • Monitoring Tools: Working with AppDynamics and Splunk for monitoring and setting up observability. Experience implementing system and application monitoring for cloud-based applications/SaaS components, including setting up alerts and building dashboards.
  • CI/CD: Experience with CI/CD tool chains, setting up and running deployment pipelines, propagating changes on different environments, and troubleshooting failed deployments.
  • SQL: Working knowledge of SQL and troubleshooting by writing queries.
  • SRE Principles: Knowledge of applying SRE practices to daily operations is key, particularly toil reduction, blameless post-mortems, monitoring distributed systems, and release engineering.

Experience & Qualifications:

  • Minimum 7 years of work experience as an SRE in critical production environments.
  • Experience working as a DevOps Engineer or SRE in mission-critical applications and infrastructure.
  • Ability to work in shifts in office (24/7 on-desk operation is mandatory).

Certifications (Mandatory/Preferred):

  • SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory. (SRE Foundation certification via PeopleSoft / DevOps Institute is beneficial).
  • CKA certification (Certified Kubernetes Administrator).
  • GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus.
  • Hazelcast Platform Operations certification badge (Preferred).
  • ITIL4 Foundation certification is preferred.
  • AWS Solutions Architect - Associate qualification or alternative is preferred.

Essential Professional Skills

  • Problem-Solving & Debugging: Strong ability to diagnose and debug systems at the application level. Must be inclined to work on proof-of-concept solutions to optimize reliability (e.g., AI models for event correlation and assisted triaging).
  • Observability Definition: Ability to define SLIs and configure SLOs, and continuously refine thresholds.
  • Communication & Collaboration: Excellent communication skills. Ability to work collaboratively in a team and with Development, QA, and other squads. Direct liaison with customers for stakeholder management.
  • Ownership: Be the guardian to ensure high reliability.
  • Learning Agility: Ability to continuously refine alerts and ensure all alerts/incidents within scope are actioned upon before breaching SLOs.

Company Info

Company: Nana Biscuits

Type: Food Processing

Contact Person: Vikrant Soni

Email: vxxxxxxxxxxx@live.com

Phone: 75xxxxx01

Website: https://www.nanabiscuits.com

Address: C-3/2, Radio Colony, Kingsway Camp