Cloud NOC Engineer - MX

Indeed

Full-time

Onsite

No experience limit

No degree limit

Isabel La Católica 5, Centro Histórico de la Cdad. de México, Centro, Cuauhtémoc, 06000 Ciudad de México, CDMX, Mexico

Favourites

Some content was automatically translatedView Original

Description

Job Summary: We are seeking a Cloud NOC Engineer to proactively monitor data center health, manage incidents, and ensure continuity of critical operations. Key Highlights: 1. 24/7 proactive monitoring of critical infrastructure 2. End-to-end incident management and structured technical escalation 3. Use of advanced monitoring and observability tools ### **Overview** Whitestack deploys private clouds across multiple capitals in Latin America. At each site, dozens or even hundreds of servers operate, interconnected via high-speed networks and designed to support mission-critical applications—including mobile operator voice traffic—requiring availability levels approaching 99.999%. For this reason, we are seeking top-tier engineers for our *Cloud Support* team—roles of high strategic importance to ensure uninterrupted operation of large-scale data centers supporting critical, always-on telecommunications applications and infrastructure that we deploy. The **Cloud NOC Engineer** is the guardian of this infrastructure. Their mission is 24/7 proactive monitoring of data center health, detecting anomalies before they impact service. They serve as the first line of response, responsible for end-to-end incident management—from alert detection and ticket creation through resolution of low-to-medium complexity failures and structured technical escalation to L1/L2. **This role is available for remote work from the following locations: Mexico, Chile, Argentina, Colombia, Uruguay, and Peru.** **Available shifts: Mexico, Colombia, Peru starting at 1 PM. / Argentina, Chile, Uruguay starting at 8 AM.** ### **Responsibilities** * Proactive Monitoring: Continuous oversight of dashboards and alerts (physical infrastructure, virtual infrastructure, and services) to guarantee 99.999% availability. * Incident Management (Triage): Receiving, categorizing, and prioritizing alerts; rigorous ticket creation and tracking under ITIL methodologies. * Initial Technical Resolution: Diagnosis and resolution of low- and medium-complexity failures (e.g., service restarts, log cleanup, quota adjustments, basic connectivity verification). * Structured Escalation: When complexity exceeds initial capability, escalate to L1/L2 with a complete technical report (logs, network traces, reproduction steps, and customer context). * Case Documentation: Maintain up-to-date event logs and knowledge base (KB) entries for recurring incidents. * External Communication: Notify customers clearly and promptly regarding system health status, maintenance windows, and ongoing incidents. * Health Checks: Execute periodic validation routines to verify production platform health. * Ensure compliance with SLAs for incidents and network/service availability. * Generate and analyze platform availability reports. ### **Requirements** * Experience: + Minimum 1–2 years in Network Operations Centers (NOC), Tier-1 technical support, or systems administration. + Experience managing tickets and support processes (Jira, ServiceNow, or similar), including clear documentation of diagnostics, evidence, and communication. + Experience with Monitoring/Observability tools such as Prometheus, Grafana, Elasticsearch, OpenSearch, OpenNMS; ability to read and interpret metrics, events, logs, and alerts. + Experience supporting mission-critical production systems, including incident management, coordination of production actions, escalation, and effective communication. * Education: + Bachelor’s degree in Computer Engineering, Systems Engineering, Electronics Engineering, or related field. * Specific Knowledge / Technical Requirements: + Linux in production environments: OS and service troubleshooting (systemd, journalctl), permissions/users, processes, filesystems, and networking. + Linux Networking: configuration and troubleshooting of interfaces, VLANs, routes, bonding, and MTU; network troubleshooting using tools such as tcpdump (sniffing), ip, ss, ethtool, ping/traceroute. + Kubernetes: production operation/administration and troubleshooting (Pods, Deployments/DaemonSets, Services, events/logs, readiness/liveness; familiarity with storage PV/PVC). + Virtualization: experience operating and supporting virtualized environments (KVM/VMware/Hyper-V or others), including diagnosis of common compute, network, and storage failures. + Automation: ability to automate repetitive tasks using Bash and Ansible and/or Python (information gathering, operational checks, basic remediation, safe production scripts). + Intermediate English proficiency for reading/writing technical documentation, updating stakeholders, and interacting with vendors/manufacturers during support cases. * Professional Requirements + Autonomy (to achieve optimal results) + Adherence to world-class standards + Goal orientation + Openness to learning new technologies + Analytical thinking + Teamwork (to coordinate with development and product deployment teams) + Rapid adaptation to a highly dynamic environment * Desired Technical Requirements + Experience with OpenStack (operation, troubleshooting, or administration) and/or KVM + Understanding of Fixed or Mobile Network operations models + Experience integrating and operating open-source projects in production environments + Intermediate Networking: BGP, EVPN-VXLAN, etc. + Certifications: Linux, OpenStack, Kubernetes Administrator (CKA or equivalent) + Training in Ansible and/or Bash scripting + Knowledge of ITIL (Incident, Request, Problem, Change Management) and/or Scrum #### **About Us** **Whitestack** is a leading Latin American company specializing in cloud solutions and hyper-scalable digital infrastructure. We leverage open-source technologies and industry-leading standards to drive digital transformation across the region. We are a **Great Place to Work**, where innovation, collaboration, and personal development are at the core of our culture. **Why join Whitestack?** International exposure: Participate in global initiatives and travel to collaborate with teams across countries. ️ Real work-life balance: Policies designed to fit your lifestyle—enabling autonomous, purpose-driven work. Clear growth path: A robust career track in both leadership and technology. Health first: Private medical insurance for you and your family. Unlimited learning: Access to courses, books, materials, and certification reimbursement. Languages for the world: Language courses to ensure your growth knows no borders. Technology in your hands: Equipment refreshed every 3 years—and yours to keep at the end of the cycle! Recognition for effort: Performance and project success bonuses. Time for you: Minimum 15 vacation days, a birthday day off, and additional breaks before Independence Day, Christmas, and New Year. Connection & fun: Budget for recreational and team-building activities. Innovation culture: Your ideas matter. We encourage strategic participation from every role. Learn more about our benefits here.

Source: indeed View original post