




Job Summary: We are seeking a Cloud NOC Engineer to proactively monitor data center health, manage incidents, and ensure continuity of critical operations. Key Highlights: 1. 24/7 proactive monitoring of critical infrastructure 2. End-to-end incident management and structured technical escalation 3. Use of advanced monitoring and observability tools ### **Overview** Whitestack deploys private clouds across multiple capitals in Latin America. At each site, dozens or even hundreds of servers operate, interconnected via high-speed networks and designed to support mission-critical applications—including mobile operator voice traffic—requiring availability levels approaching 99.999%. For this reason, we are seeking top-tier engineers for our *Cloud Support* team—roles of high strategic importance to ensure uninterrupted operation of large-scale data centers supporting critical, always-on telecommunications applications and infrastructure that we deploy. The **Cloud NOC Engineer** is the guardian of this infrastructure. Their mission is 24/7 proactive monitoring of data center health, detecting anomalies before they impact service. They serve as the first line of response, responsible for end-to-end incident management—from alert detection and ticket creation through resolution of low-to-medium complexity failures and structured technical escalation to L1/L2. **This role is available for remote work from the following locations: Mexico, Chile, Argentina, Colombia, Uruguay, and Peru.** **Available shifts: Mexico, Colombia, Peru starting at 1 PM. / Argentina, Chile, Uruguay starting at 8 AM.** ### **Responsibilities** * Proactive Monitoring: Continuous oversight of dashboards and alerts (physical infrastructure, virtual infrastructure, and services) to guarantee 99.999% availability. * Incident Management (Triage): Receiving, categorizing, and prioritizing alerts; rigorous ticket creation and tracking under ITIL methodologies. * Initial Technical Resolution: Diagnosis and resolution of low- and medium-complexity failures (e.g., service restarts, log cleanup, quota adjustments, basic connectivity verification). * Structured Escalation: When complexity exceeds initial capability, escalate to L1/L2 with a complete technical report (logs, network traces, reproduction steps, and customer context). * Case Documentation: Maintain up-to-date event logs and knowledge base (KB) entries for recurring incidents. * External Communication: Notify customers clearly and promptly regarding system health status, maintenance windows, and ongoing incidents. * Health Checks: Execute periodic validation routines to verify production platform health. * Ensure compliance with SLAs for incidents and network/service availability. * Generate and analyze platform availability reports. ### **Requirements** * Experience: + Minimum 1–2 years in Network Operations Centers (NOC), Tier-1 technical support, or systems administration. + Experience managing tickets and support processes (Jira, ServiceNow, or similar), including clear documentation of diagnostics, evidence, and communication. + Experience with Monitoring/Observability tools such as Prometheus, Grafana, Elasticsearch, OpenSearch, OpenNMS; ability to read and interpret metrics, events, logs, and alerts. + Experience supporting mission-critical production systems, including incident management, coordination of production actions, escalation, and effective communication. * Education: + Bachelor’s degree in Computer Engineering, Systems Engineering, Electronics Engineering, or related field. * Specific Knowledge / Technical Requirements: + Linux in production environments: OS and service troubleshooting (systemd, journalctl), permissions/users, processes, filesystems, and networking. + Linux Networking: configuration and troubleshooting of interfaces, VLANs, routes, bonding, and MTU; network troubleshooting using tools such as tcpdump (sniffing), ip, ss, ethtool, ping/traceroute. + Kubernetes: production operation/administration and troubleshooting (Pods, Deployments/DaemonSets, Services, events/logs, readiness/liveness; familiarity with storage PV/PVC). + Virtualization: experience operating and supporting virtualized environments (KVM/VMware/Hyper-V or others), including diagnosis of common compute, network, and storage failures. + Automation: ability to automate repetitive tasks using Bash and Ansible and/or Python (information gathering, operational checks, basic remediation, safe production scripts). + Intermediate English proficiency for reading/writing technical documentation, updating stakeholders, and interacting with vendors/manufacturers during support cases. * Professional Requirements + Autonomy (to achieve optimal results) + Adherence to world-class standards + Goal orientation + Openness to learning new technologies + Analytical thinking + Teamwork (to coordinate with development and product deployment teams) + Rapid adaptation to a highly dynamic environment * Desired Technical Requirements + Experience with OpenStack (operation, troubleshooting, or administration) and/or KVM + Understanding of Fixed or Mobile Network operations models + Experience integrating and operating open-source projects in production environments + Intermediate Networking: BGP, EVPN-VXLAN, etc. + Certifications: Linux, OpenStack, Kubernetes Administrator (CKA or equivalent) + Training in Ansible and/or Bash scripting + Knowledge of ITIL (Incident, Request, Problem, Change Management) and/or Scrum #### **About Us** **Whitestack** is a leading Latin American company specializing in cloud solutions and hyper-scalable digital infrastructure. We leverage open-source technologies and industry-leading standards to drive digital transformation across the region. We are a **Great Place to Work**, where innovation, collaboration, and personal development are at the core of our culture. **Why join Whitestack?** International exposure: Participate in global initiatives and travel to collaborate with teams across countries. ️ Real work-life balance: Policies designed to fit your lifestyle—enabling autonomous, purpose-driven work. Clear growth path: A robust career track in both leadership and technology. Health first: Private medical insurance for you and your family. Unlimited learning: Access to courses, books, materials, and certification reimbursement. Languages for the world: Language courses to ensure your growth knows no borders. Technology in your hands: Equipment refreshed every 3 years—and yours to keep at the end of the cycle! Recognition for effort: Performance and project success bonuses. Time for you: Minimum 15 vacation days, a birthday day off, and additional breaks before Independence Day, Christmas, and New Year. Connection & fun: Budget for recreational and team-building activities. Innovation culture: Your ideas matter. We encourage strategic participation from every role. Learn more about our benefits here.


