Services

AI Data Center Networking

The network is the backbone of every AI system. I help organizations design, optimize, and operate the high-performance fabrics that keep GPU clusters running at peak efficiency — from single-site deployments to multi-DC AI factories.

AI training and inference workloads make extreme demands on network infrastructure. Latency spikes, congestion, and misconfigured topologies directly reduce GPU utilization — meaning your most expensive hardware sits idle waiting for data.

I bring hands-on expertise across the full stack of AI networking: from physical topology design and overlay/underlay protocol selection to subnet management, congestion control tuning, Kubernetes pod networking, and multi-site scale-out. Whether you're building a new cluster or troubleshooting an underperforming one, I can help.

Expertise Areas

Network Architecture & Design

Fat-tree, dragonfly & rail-optimized topologies
InfiniBand HDR/NDR fabric design
RoCEv2 / RDMA over Ethernet
Multi-rail GPU-to-GPU interconnects

Overlay & Underlay Protocols

BGP-based EVPN overlay design
VXLAN & Segment Routing (SR) underlays
IP fabric design for AI workloads
Multi-path routing and ECMP

Subnet Management & Operations

OpenSM / NVIDIA UFM configuration
Subnet partitioning and QoS policies
Adaptive routing and load balancing
Firmware and driver management

Scale & High Availability

Scale-out architectures (intra-DC)
Inter-DC connectivity and federation
High availability design and failover
Capacity planning and growth modeling

Performance Optimization

Congestion control (ECN, PFC, DCQCN)
GPU utilization and MFU analysis
Network bottleneck identification
All-reduce communication optimization

Orchestration & Tooling

Kubernetes networking & pod management
Slurm / job scheduler integration
Network telemetry and observability
Automated fault detection & alerting

Platforms & Technologies

NVIDIA InfiniBand NVIDIA UFM Mellanox/NVIDIA NICs OpenSM RoCEv2 RDMA BGP / EVPN VXLAN Segment Routing DGX H100/A100 HGX Systems GPUDirect RDMA Spectrum Ethernet Arista Cisco Nexus NCCL MPI Kubernetes Slurm Prometheus / Grafana

Engagement Types

Architecture Review

A structured review of your current or planned network design with a written assessment and prioritized recommendations.

Infrastructure Design

End-to-end design of AI Factory infrastructure — from fabric topology and protocol selection to physical layout, redundancy, and day-two operations — built to support your AI workloads at scale.

Optimized Vendor Product Selection

Cutting through the noise to identify the right vendors for each layer of your network based on your bespoke business and technical requirements, resiliency needs, and scaling trajectory — so you invest in solutions that fit, not just what's being marketed.

Ongoing Advisory

A retainer relationship for organizations that want a trusted expert available as their AI infrastructure evolves — for design reviews, vendor evaluations, and escalation support.

Is your AI network holding back your GPUs?

Let's find out. A brief conversation is often enough to identify quick wins.

Schedule a Call