Top Enterprise Servers for LLM Inference with NVIDIA A100 GPUs: Dell and HPE Solutions

Before diving into the details, it’s worth noting that the most powerful enterprise datacenter servers from Dell and HPE capable of running large language models like Nemotron 70B and Llama 3.2 70B all leverage NVIDIA’s A100 GPUs. These high-performance computing platforms deliver the computational power needed for inference on billion-parameter models while maintaining reasonable response times for production environments.

Dell PowerEdge XE8545: AI Infrastructure Powerhouse

The Dell PowerEdge XE8545 represents Dell’s flagship offering for AI workloads, including LLM inference. This 4U rack server combines AMD EPYC processors with NVIDIA A100 Tensor Core GPUs in a configuration optimized for machine learning, high-performance computing, and GPU virtualization.

Architecture and Specifications

The XE8545 features a dual-socket design with the highest core count AMD EPYC processors (up to 128 cores) alongside four NVIDIA A100 Tensor Core GPUs in an SXM4 form factor. This server boasts PCIe Gen 4.0 connectivity and NVMe SSDs to eliminate data throughput bottlenecks when working with large datasets.

The system offers two GPU configuration options:

  • Four NVIDIA A100 SXM4 GPUs with 80GB memory per GPU
  • Four NVIDIA A100 SXM4 GPUs with 40GB memory per GPU

Performance Benchmarks

For large language model inference, a single XE8545 with 4× NVIDIA A100 SXM 80GB GPUs delivers approximately 621.4 output tokens per second with a P95 token latency of 50.78ms. This configuration can support up to 124 simultaneous interactive LLM sessions.

When running Llama3 70B specifically, performance testing shows roughly 414 words per second with a latency of 76ms on a single node with 4× A100 GPUs. Benchmark data further indicates that:

  • Word-level throughput scales from 414 words/sec (76ms latency) on a single node to 782 words/sec (100ms latency) across two nodes with 8× A100 GPUs.
  • For model-specific performance, Llama3 70B runs at about 24.09 tokens/sec with 4-bit quantization on a single A100.

Pricing

While specific list pricing for the fully configured XE8545 isn’t directly provided in the source materials, component pricing suggests a high-end enterprise solution. The NVIDIA A100 80GB GPUs alone cost approximately $33,000 each, and a full system with four 80GB GPUs would likely be priced in the $150,000–$200,000 range.

Dell DGX A100 P3687: Purpose-Built AI System

The Dell DGX A100 P3687 is a specialized AI computing platform designed for deep learning and LLM workloads. This complete system comes with NVIDIA’s software stack pre-integrated, providing immediate productivity for AI development.

Architecture and Specifications

The DGX A100 features eight NVIDIA A100 Tensor Core GPUs and two 2nd Gen AMD EPYC processors in a tightly integrated platform. It is available in multiple configurations with options for 40GB or 80GB GPU memory. The system is designed to support the full AI lifecycle—from model creation to optimization—delivering superior performance.

Performance Benchmarks

Performance testing shows that a DGX A100 with 8× NVIDIA A100 SXM 80GB GPUs can achieve approximately 1,172.63 tokens per second with a P95 latency of 66.41ms. When scaled to 12 GPUs across multiple nodes, throughput increases to 1,551.94 tokens per second while maintaining similar latency.

Pricing

The Dell DGX A100 P3687 with 8× 40GB GPUs is priced at approximately $255,012.16. Higher-end configurations with 8× 80GB GPUs range from $199,357.13 to $307,367.09 depending on support options and specific configurations.

HPE ProLiant DL380 Gen10 Plus: Versatile Server Platform

The HPE ProLiant DL380 Gen10 Plus offers a balance of expandability and scalability for AI inference workloads. Built on 3rd Gen Intel Xeon Scalable Processors with PCIe Gen4 capabilities, it supports NVIDIA A100 GPUs in PCIe form factor and is designed for flexible, cost-effective AI deployments.

Architecture and Specifications

The DL380 Gen10 Plus is configured with 3rd Gen Intel Xeon Scalable Processors and can support up to 3 NVIDIA A100 80GB PCIe GPUs. It features improved data transfer rates, higher networking speeds, and comes with recommended cooling and power configurations.

Performance Benchmarks

Real-world testing indicates that a single A100 40GB GPU can run Llama3 70B at approximately 24.09 tokens per second with 4-bit quantization. When scaling to 2-3 GPUs, the DL380 Gen10 Plus supports multiple concurrent inference sessions with nearly linear performance gains.

Pricing

The base HPE ProLiant DL380 Gen10 Plus server starts at approximately $3,099.00 (without GPUs). Each NVIDIA A100 PCIe GPU for HPE costs around $29,112.00 for the 40GB version or about $33,000 for the 80GB version. A fully configured system with 3× A100 80GB GPUs is typically priced in the $100,000–$150,000 range, offering a cost-effective alternative to Dell’s high-end configurations.

Conclusion

These enterprise datacenter servers provide the necessary computational power to run sophisticated LLMs such as Nemotron 70B and Llama 3.2 70B. The Dell DGX A100 offers the highest performance with its 8-GPU configuration, while the Dell PowerEdge XE8545 provides a balanced solution with 4 GPUs, and the HPE ProLiant DL380 Gen10 Plus delivers a flexible, cost-effective option with up to 3 GPUs. Organizations deploying large language models in production environments now have state-of-the-art options to choose from, with performance scaling nearly linearly as more GPUs are added.

About the Author

Rejith Krishnan

Rejith Krishnan is the Founder and CEO of lowtouch.ai, a platform dedicated to empowering enterprises with private, no-code AI agents. With expertise in Site Reliability Engineering (SRE), Kubernetes, and AI systems architecture, he is passionate about simplifying the adoption of AI-driven automation to transform business operations.

Rejith specializes in deploying Large Language Models (LLMs) and building intelligent agents that automate workflows, enhance customer experiences, and optimize IT processes, all while ensuring data privacy and security. His mission is to help businesses unlock the full potential of enterprise AI with seamless, scalable, and secure solutions that fit their unique needs.

About lowtouch.ai

lowtouch.ai delivers private, no-code AI agents that integrate seamlessly with your existing systems. Our platform simplifies automation and ensures data privacy while accelerating your digital transformation. Effortless AI, optimized for your enterprise.

2025
Convergence India Expo
19th – 21st March

New Delhi, India

2025
NVIDIA GTC 2025
March 17-21

San Jose, CA