Zero Trust in Distributed AI: Implementing NVIDIA Confidential Computing

How DataQI helped Manifold Labs guarantee security for high-value AI workloads on distributed clusters using NVIDIA Confidential Computing.

ZERO TRUST IN A DISTRIBUTED WORLD

The Challenge:

In the rapidly evolving sector of distributed AI, Manifold Labs faced a critical infrastructure paradox: their business model relied on running high-value AI workloads on distributed clusters, but they did not physically own the servers.

Standard security protocols, encrypting data at rest (storage) and in transit (network), were insufficient. The moment data was loaded into memory for processing, it became vulnerable. For Manifold Labs, this "clear text" gap was a showstopper. To maintain user trust, they needed to guarantee that private workloads could not be accessed or tampered with by anyone, including the owners of the host servers or malicious administrators.

This wasn’t just a permission issue; it was a fundamental hardware challenge. They required a solution where the entire compute stack, CPU, RAM, GPU, VRAM, and PCIe bus communications, was cryptographically protected from the host OS.

The Response:

ENGINEERING THE IMPOSSIBLE

DataQI identified NVIDIA Confidential Computing as the only viable solution to bridge this trust gap. By leveraging hardware-based Trusted Execution Environments (TEEs), we could isolate the memory and GPU execution state from the host. However, implementing this on bleeding-edge hardware required navigating a labyrinth of technical complexities.

Phase 1: VALIDATING THE ATTACK VECTORS

We began by validating the architecture in a Proof of Concept (PoC) environment.

Working closely with NVIDIA solution architects, we configured a Confidential Virtual Machine (CVM) on an AMD-based server using AMD SEV-SNP with KVM.

This was far from "plug-and-play"; it demanded deep kernel-level engineering, including patching the Linux kernel on Ubuntu to support specific confidential computing features.

This phase successfully verified that we could perform GPU attestation against NVIDIA’s documentation.

Phase 2: SCALING TO PRODUCTION (THE H100 HURDLE)

Moving to production meant scaling up to beasts of computation: four Supermicro GPU SuperServer SYS-821GE-TNHR units, each equipped with eight NVIDIA H100 GPUs connected via SXM.

Here, we encountered the project's toughest technical hurdle. When enabling Intel TDX (Trusted Domain Extensions) on this specific architecture, the servers failed to boot. This was a critical roadblock involving the interplay between the motherboard, CPU, and GPU security protocols.

DataQI engineers facilitated a deep-dive collaboration between engineers from Intel and Supermicro to troubleshoot the stack. We identified that the issue lay in firmware incompatibilities. By isolating and installing the correct firmware versions for both the BIOS and the H100 GPUs, we successfully enabled the host servers for Confidential Computing.

Phase 3: AUTOMATING TRUST WITH GO

Manual validation of such a complex stack is neither scalable nor secure. To solve this, DataQI developed a custom Go-based application that acts as the gatekeeper for the distributed cluster.

This tool performs a three-step security handshake:

  • Validation: It runs system-level checks and securely transmits a host validation report to the client.
  • Deployment: Only if validation succeeds, the system downloads and launches the pre-built Confidential Virtual Machine.
  • Attestation: Inside the CVM, a secure service verifies the integrity of the VM itself, while GPU attestation confirms the trusted state of each H100 card against NVIDIA services.

The Result:

TOTAL CONFIDENCE IN EVERY BYTE

By rigorously engineering the stack from the firmware up, DataQI delivered a fully verified, encrypted AI environment.

  • Cryptographic Isolation: Workloads now run inside encrypted CVMs where no host operator can observe the data.
  • Verified Integrity: The client can verify the exact state of the CPU and GPU before processing a single byte of data.

Market Readiness: Manifold Labs can now deploy proprietary AI models to distributed clusters with total confidence, unlocking a new tier of secure, distributed computing.

WHY DATAQI?

This project wasn't about simply installing software; it was about orchestrating a solution across vendors (NVIDIA, Intel, Supermicro) and solving low-level firmware conflicts that had no documented fix.

DataQI’s ability to combine high-level software engineering for automation with bare-metal engineering makes us the ideal partner for the most demanding infrastructure challenges.