Job Overview:
We seek a Service Reliability Analyst – Network Infrastructure to improve our network systems' stability, performance, and observability. This role merges traditional NOC duties with modern AI Ops practices in a 24/7 shift-based team. The ideal candidate will detect and resolve network anomalies, use AI/ML insights to optimize operations, and support continuous service availability and performance improvement.
Responsibilities:
- Lead sophisticated solve and resolution of network incidents spanning LAN, WAN, VPN, SD-WAN, data centers, and cloud networks (AWS, Azure, GCP).
- Drive adoption and integration of AI Ops tools (e.g., Dynatrace, LogicMonitor) to enable proactive anomaly detection, alert correlation, and incident automation.
- Work with engineering and platform teams to expand observability coverage, tune alerting thresholds, and onboard new network services to SRC monitoring.
- Perform deep-dive root cause analyses (RCAs), lead incident reviews, and implement preventive actions to improve service resilience.
- Design and build dashboards, reliability reports, and KPIs (MTTR, latency, packet loss, availability) to improve visibility and decision-making.
- Contribute to network automation initiatives using tools like Ansible and Terraform; develop and maintain intelligent playbooks for remediation workflows.
- Tune and optimize AI/ML models used in telemetry analysis and predictive incident detection.
- Work on a shift pattern, on a 24/7/365 operating model, while being able to work independently and flexibly in response to emergencies or critical issues
- Certifications such as Cisco CCNA/CCNP, CompTIA Network+, or equivalent.
- In addition, the Cisco DevNet Certification would be highly advantageous.
- Hands-on experience with network technologies and protocols (TCP/IP, BGP, OSPF, DNS, DHCP, SDWAN).
- Experience with public cloud networking (AWS, Azure, GCP).
- Familiarity with ITIL and SRE principles (SLI/SLOs, error budgets, incident command).
- Experience integrating AI Ops tools with ITSM systems (e.g., ServiceNow, Jira Service Management).
- Exposure to automation/orchestration tools (Ansible and Terraform).
Required Skills and Experience:
- 3–6 years of hands-on experience in Platform Operations, or Infrastructure Support roles.
- Good experience with observability tools (e.g., Dynatrace, Logic Monitor, Datadog, Splunk) for real-time monitoring, alerting, and diagnostics.
- Proficiency in a scripting or programming languages (e.g., Python, Java, .NET, Node.js, Ansible or JavaScript).
- Practical knowledge of infrastructure automation using Ansible, including writing playbooks.
- Proficient in ticket management via an ITSM platform such as ServiceNow.
- Experience leading incident response, driving service restoration and coordinating root cause analysis.
- Effective communicator within a team with a proactive approach and personal accountability for outcomes.
- Ability to analyze incident patterns and metrics to proactively recommend reliability improvements.
“Nice To Have” Skills and Experience:
- Exposure to high performance computing or cloud-native services.
- Experience with telemetry and observability tools!
- Experience creating or updating Ansible playbooks for repetitive tasks or configuration.
- Curiosity about automation and DevOps practices!
#LI-SA3
Accommodations at Arm
At Arm, we want to build extraordinary teams. If you need an adjustment or an accommodation during the recruitment process, please email accommodations@arm.com. To note, by sending us the requested information, you consent to its use by Arm to arrange for appropriate accommodations. All accommodation or adjustment requests will be treated with confidentiality, and information concerning these requests will only be disclosed as necessary to provide the accommodation. Although this is not an exhaustive list, examples of support include breaks between interviews, having documents read aloud, or office accessibility. Please email us about anything we can do to accommodate you during the recruitment process.
Hybrid Working at Arm
Arm’s approach to hybrid working is designed to create a working environment that supports both high performance and personal wellbeing. We believe in bringing people together face to face to enable us to work at pace, whilst recognizing the value of flexibility. Within that framework, we empower groups/teams to determine their own hybrid working patterns, depending on the work and the team’s needs. Details of what this means for each role will be shared upon application. In some cases, the flexibility we can offer is limited by local legal, regulatory, tax, or other considerations, and where this is the case, we will collaborate with you to find the best solution. Please talk to us to find out more about what this could look like for you.
Equal Opportunities at Arm
Arm is an equal opportunity employer, committed to providing an environment of mutual respect where equal opportunities are available to all applicants and colleagues. We are a diverse organization of dedicated and innovative individuals, and don’t discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.