Services

AI Data Center Networking

The network is the backbone of every AI system. I help organizations design, optimize, and operate the high-performance fabrics that keep GPU clusters running at peak efficiency — from single-site deployments to multi-DC AI factories.

AI training and inference workloads make extreme demands on network infrastructure. Latency spikes, congestion, and misconfigured topologies directly reduce GPU utilization — meaning your most expensive hardware sits idle waiting for data.

I bring hands-on expertise across the full stack of AI networking: from physical topology design and overlay/underlay protocol selection to subnet management, congestion control tuning, Kubernetes pod networking, and multi-site scale-out. Whether you're building a new cluster or troubleshooting an underperforming one, I can help.

Expertise Areas

Network Architecture & Design

  • Fat-tree, dragonfly & rail-optimized topologies
  • InfiniBand HDR/NDR fabric design
  • RoCEv2 / RDMA over Ethernet
  • Multi-rail GPU-to-GPU interconnects

Overlay & Underlay Protocols

  • BGP-based EVPN overlay design
  • VXLAN & Segment Routing (SR) underlays
  • IP fabric design for AI workloads
  • Multi-path routing and ECMP

Subnet Management & Operations

  • OpenSM / NVIDIA UFM configuration
  • Subnet partitioning and QoS policies
  • Adaptive routing and load balancing
  • Firmware and driver management

Scale & High Availability

  • Scale-out architectures (intra-DC)
  • Inter-DC connectivity and federation
  • High availability design and failover
  • Capacity planning and growth modeling

Performance Optimization

  • Congestion control (ECN, PFC, DCQCN)
  • GPU utilization and MFU analysis
  • Network bottleneck identification
  • All-reduce communication optimization

Orchestration & Tooling

  • Kubernetes networking & pod management
  • Slurm / job scheduler integration
  • Network telemetry and observability
  • Automated fault detection & alerting

Platforms & Technologies

NVIDIA InfiniBand NVIDIA UFM Mellanox/NVIDIA NICs OpenSM RoCEv2 RDMA BGP / EVPN VXLAN Segment Routing DGX H100/A100 HGX Systems GPUDirect RDMA Spectrum Ethernet Arista Cisco Nexus NCCL MPI Kubernetes Slurm Prometheus / Grafana

Engagement Types

Architecture Review

A structured review of your current or planned network design with a written assessment and prioritized recommendations.

Infrastructure Design

End-to-end design of AI Factory infrastructure — from fabric topology and protocol selection to physical layout, redundancy, and day-two operations — built to support your AI workloads at scale.

Optimized Vendor Product Selection

Cutting through the noise to identify the right vendors for each layer of your network based on your bespoke business and technical requirements, resiliency needs, and scaling trajectory — so you invest in solutions that fit, not just what's being marketed.

Ongoing Advisory

A retainer relationship for organizations that want a trusted expert available as their AI infrastructure evolves — for design reviews, vendor evaluations, and escalation support.

Is your AI network holding back your GPUs?

Let's find out. A brief conversation is often enough to identify quick wins.

Schedule a Call