Zero Trust in Distributed AI: Implementing NVIDIA Confidential Computing

The Infrastructure Paradox at the Heart of Distributed AI

The distributed GPU economy is built on a fundamental tension: organisations want to run sensitive, high-value AI workloads at scale, but the hardware they need is often owned and operated by someone else. Conventional security models — encrypting data at rest and in transit — address only two of the three vulnerability windows. The third, and most critical, is the moment data is loaded into system memory for active computation.

For Manifold Labs, this "clear text" window was not an acceptable risk. Their business model depended on processing proprietary AI workloads on distributed clusters where they had no physical control over the underlying servers. Any host-level administrator, privileged software process, or sophisticated attacker with kernel access could theoretically observe the memory contents of an active GPU workload. Standard permission controls cannot solve a hardware problem.

According to the IBM Cost of a Data Breach Report 2024, the global average cost of a data breach reached $4.88 million — the highest figure on record. For organisations processing third-party AI workloads, the liability exposure extends beyond their own data to that of every client whose models run on their infrastructure.

Why Conventional Encryption Cannot Protect Active Computation

Data-at-rest encryption protects files stored on disk. Transport Layer Security (TLS) protects data moving between systems. Neither mechanism applies to data actively being processed by a CPU or GPU — at that point, the plaintext must exist in memory for the processor to operate on it.

This is not a software limitation that a firmware update or a more sophisticated key management policy can resolve. It is a fundamental characteristic of how processors work. The only solution is hardware-level isolation — cryptographically separating the memory space of a computation from the rest of the host system, including the hypervisor, the host operating system, and privileged kernel processes.

NVIDIA Confidential Computing provides exactly this capability, extending hardware-enforced Trusted Execution Environments (TEEs) from the CPU into the GPU — protecting not just the CPU memory, but the GPU VRAM and the PCIe bus communications between them.

Phase 1: Validating the Architecture with AMD SEV-SNP

DataQI began with a structured Proof of Concept on an AMD-based server, configuring a Confidential Virtual Machine (CVM) using AMD Secure Encrypted Virtualisation with Secure Nested Paging (SEV-SNP). This technology — developed to prevent hypervisor-level attacks — encrypts each virtual machine's memory with a unique key managed by the AMD Platform Security Processor, ensuring no other software entity on the host can read its contents.

Deploying this architecture in practice required low-level kernel engineering that went far beyond standard configuration. The DataQI team patched the Linux kernel on Ubuntu to enable specific confidential computing features, and worked directly with NVIDIA solution architects to validate GPU attestation — the process by which a remote client cryptographically verifies that the GPU is genuine, unmodified hardware running approved firmware, before trusting it with sensitive data.

The PoC phase successfully demonstrated that GPU attestation against NVIDIA's attestation service was achievable in a controlled environment, confirming the technical viability of the full production architecture.

Phase 2: Scaling to Production — The H100 Firmware Hurdle

Moving from proof of concept to production meant scaling to four Supermicro GPU SuperServer SYS-821GE-TNHR units, each equipped with eight NVIDIA H100 GPUs interconnected via NVLink SXM — a configuration offering up to 3.2 TB/s of bidirectional bandwidth. This is the class of hardware required for serious frontier AI workloads.

The production environment introduced a critical failure: when enabling Intel Trusted Domain Extensions (TDX) on this architecture, the servers failed to boot entirely. TDX, Intel's hardware virtualisation-based TEE for server-class workloads, was incompatible with the existing firmware stack on this specific combination of motherboard, CPU, and GPU.

DataQI engineers facilitated a direct multi-vendor troubleshooting collaboration between Intel and Supermicro engineering teams. After systematic firmware isolation, the team identified incompatibilities between the BIOS version and the H100 GPU firmware. Installing the correct firmware versions for both resolved the boot failure and enabled Confidential Computing on the full production cluster — a resolution that had no documented fix available at the time of discovery.

Phase 3: Automating Trust at Scale with a Custom Go Application

A manually-validated security stack is neither operationally scalable nor cryptographically auditable. Human verification introduces inconsistency; automated verification produces a mathematical proof. To close this gap, DataQI developed a bespoke Go-based application that acts as an automated gatekeeper for the distributed cluster.

The application implements a three-stage security handshake before any workload is permitted to execute:

Host Validation: The application runs a comprehensive suite of system-level security checks on the host environment, generating a signed validation report that is securely transmitted to the client prior to any compute being initiated.
CVM Deployment: Only upon successful validation does the system download and launch the pre-built Confidential Virtual Machine image. A failed validation halts the process entirely — no data is ever transferred to a host that cannot be verified.
GPU Attestation: Inside the running CVM, a dedicated attestation service verifies the cryptographic integrity of the VM itself, while NVIDIA's remote attestation service confirms the trusted firmware state of each H100 card independently.

The result is a fully automated, auditable chain of custody: every computation is preceded by a verifiable proof that the hardware is genuine, the firmware is approved, and the execution environment has not been tampered with.

The Result: Cryptographic Confidence in Every Byte

DataQI delivered a production-grade, fully verified confidential computing environment. The security guarantees are not procedural — they are mathematical.

Cryptographic Memory Isolation: All workloads execute inside encrypted CVMs where no host operator, hypervisor, or privileged kernel process can observe memory contents. The encryption keys are managed by hardware security processors, not by any software accessible to the host.
Verified Hardware Integrity: Before processing a single byte of client data, the client can independently verify the exact hardware state of every CPU and GPU in the cluster using NVIDIA and AMD attestation services — a level of pre-compute trust not achievable with any purely software-based security model.
Market Differentiation Unlocked: Manifold Labs can now offer a tier of secure, distributed AI compute that competitors without confidential computing infrastructure cannot match — opening access to enterprise clients in regulated sectors where data sovereignty is non-negotiable.

This project demonstrates that the most sophisticated infrastructure security challenges are not resolved by tightening software policies — they require engineering teams who can operate simultaneously at the firmware level, the kernel level, and the application level, and who can coordinate across vendors when documented solutions do not yet exist.

Zero Trust in Distributed AI

The Infrastructure Paradox at the Heart of Distributed AI

Why Conventional Encryption Cannot Protect Active Computation

Phase 1: Validating the Architecture with AMD SEV-SNP

Phase 2: Scaling to Production — The H100 Firmware Hurdle

Phase 3: Automating Trust at Scale with a Custom Go Application

The Result: Cryptographic Confidence in Every Byte

Need cryptographic security for your AI workloads?

Technical Stack

Keep reading

Bringing Data out of the Dark and into the Light

Building a Private AI for Enterprise Data | DataQI

The Silver Tsunami Webinar