Securing the edge

In the world of distributed AI, trust is the only currency that matters.

When a client, in the distributed AI computing sector approached DataQI, they faced a critical challenge: how do you guarantee security when your workloads are running on servers you don’t physically own?

The challenge: Zero trust in a distributed environment

The industry standard for securing high-performance compute environments, especially for LLM and AI workloads, is now Confidential Computing.

Standard encryption protects data at rest (storage) and in transit (network). However, the moment data is loaded into memory for processing, it is typically vulnerable. For our client, this gap was unacceptable. They needed a solution where the CPU, RAM, GPU, VRAM, NVLink, and PCIe bus communications were cryptographically protected, preventing unauthorized observation even from the host OS or a malicious administrator.

We identified NVIDIA Confidential Computing as the appropriate solution. By leveraging hardware-based Trusted Execution Environments (TEEs), we could ensure that the memory, CPU state, and GPU execution remained isolated from the host.

Phase 1: Proof of concept with amd sev-snp

Because of the complexity of the stack, we needed to validate every attack vector. Our journey began with deep technical sessions with NVIDIA solution architects to validate our approach to GPU attestation and encryption layers.

We started by configuring a Confidential Virtual Machine (CVM) on an AMD-based server using AMD SEV-SNP with KVM. This was not a "plug-and-play" operation; it required significant updates and patching of the Linux kernel on Ubuntu to a specific version that supported the necessary confidential computing features.

This phase confirmed that we could successfully configure a CVM and verify GPU attestation against NVIDIA’s documentation, giving us the green light to move to production hardware.

Phase 2: Scaling to supermicro h100 clusters

The production environment was significantly more powerful. We moved to configuring four Supermicro GPU SuperServer SYS-821GE-TNHR units. These are beasts of computation, designed for LLM training and inference, each equipped with eight NVIDIA H100 GPUs connected via SXM.

Enabling Confidential Computing on this specific architecture presented unique hurdles. We encountered boot issues when enabling Intel TDX (Trusted Domain Extensions).

DataQI worked closely with engineers from both Intel and Supermicro to troubleshoot the problem. We discovered the issue lay in the firmware; by installing the correct firmware versions for both the BIOS and the GPUs, we successfully enabled the host server for Confidential Computing.

Automating trust with go

Validating a complex hardware stack manually is neither scalable nor secure. To streamline this, DataQI developed a custom Go-based application to perform system-level checks on all components required for Confidential Computing.

This tool acts as the gatekeeper for the distributed cluster:

Validation: It generates a detailed host validation report and securely transmits it to the client’s control servers.
Deployment: If, and only if, validation succeeds, the system automatically downloads and launches a pre-built Confidential Virtual Machine.
Attestation: Inside the CVM, a secure service provides attestation endpoints. This allows the client to verify the integrity of the CVM itself, while GPU attestation is performed against NVIDIA services to confirm the trusted state of each H100 GPU.

The result: Verified, encrypted AI

By rigorously implementing and testing these layers, the client gained the ability to run workloads inside encrypted Confidential Virtual Machines backed by verified GPU attestation.

This architecture ensures that no host operator or external actor can access or tamper with customer data. The client can now deploy AI workloads to their distributed cluster with total confidence, knowing that the environment is cryptographically isolated and verified before a single byte of data is processed.

Next steps

Are you looking to implement Confidential Computing for your AI infrastructure? Get in touch to see how we can help secure your compute stack.

Securing the edge

In the world of distributed AI, trust is the only currency that matters.

The challenge: Zero trust in a distributed environment

Phase 1: Proof of concept with amd sev-snp

Phase 2: Scaling to supermicro h100 clusters

Automating trust with go

The result: Verified, encrypted AI

Next steps

Related content

AI Innovation Sprint Webinar

Computer Vision Webinar

AI for manufacturers