Blog

Infrastructure

How We Built a 400Gbps InfiniBand Fabric Across 12 Regions

A deep dive into the network architecture behind NeuralVane's multi-region GPU clusters. We cover topology design, congestion control, and how we achieve near-linear scaling for distributed training jobs.

January 15, 2025 · 12 min read · By Dr. Sarah Lin

Benchmark

H100 vs. H200 vs. B200: Real-World Training Benchmarks on NeuralVane

We ran identical LLM training workloads across three GPU generations on our platform. The results reveal surprising insights about memory bandwidth, interconnect utilization, and cost-per-token economics.

January 8, 2025 · 18 min read · By Marcus Rodriguez

Customer Story

How Meridian AI Reduced Training Costs by 60% After Migrating to NeuralVane

Meridian AI was spending $2.4M/month on GPU compute with a major cloud provider. After migrating to NeuralVane, they cut costs by 60% while improving training throughput by 3.2x. Here's their story.

December 20, 2024 · 8 min read · By Elena Petrov

Product Launch

Introducing NeuralVane Inference Engine: Sub-10ms Latency at Scale

Today we're launching NeuralVane Inference Engine — a fully managed serving platform optimized for LLMs and diffusion models. Automatic batching, speculative decoding, and global edge deployment built in.

December 12, 2024 · 6 min read · By Arjun Krishnamurthy

Infrastructure

Designing for Failure: Our Approach to GPU Cluster Resilience

GPUs fail. Nodes go down. Networks partition. In this post, we explain how NeuralVane's checkpoint-and-resume architecture ensures your training jobs survive hardware failures without losing progress.

November 28, 2024 · 14 min read · By James Whitfield

Benchmark

NeuralVane Storage: 120 GB/s Throughput for Data-Intensive Training Pipelines

Training large models requires feeding data at extraordinary rates. We benchmarked our distributed storage layer against S3, GCS, and local NVMe to show how NeuralVane eliminates I/O bottlenecks.