Accelerating LocalAI with an Nvidia Tesla M40: GPU acceleration guide
How I built a 24GB AI Server with an old Tesla M40 on an ASUS H87M-PLUS motherboard
Running AI locally is hot, but decent hardware is expensive. As a Digital Problem Solver, I prefer to look at what can be done with things that are already there. My goal? Set up a low-cost, high-performance local AI environment (LocalAI). My weapons?
- An old Intel i7-4770
- an ASUS H87M-PLUS motherboard from 2013
- an enterprise Nvidia Tesla M40 GPU with no less than 24GB of VRAM.
Sounds nice, but on paper this project should have failed immediately.
The Big Problem: No "Above 4G Decoding"
To address a monster video card with 12GB or 24GB of VRAM, a modern system has Above 4G Decoding (and Resizable BAR) needed in the BIOS. This allows the motherboard to allocate large blocks of memory over the PCIe bus.
After updating my ASUS motherboard to the very latest BIOS version (2108), the painful reality became clear: this option is completely missing from the menu. Asus never added this for the H87 chipset at the time.
When booting into Arch Linux and loading the Nvidia legacy drivers (nvidia-580xx-dkms), I immediately ran into a hard wall: did not want to reserve memory space. Game over? No way.
The Solution: let Linux remap the PCIe bus
Where Windows tightly follows the rules of the BIOS, you can force the Linux kernel to call the shots. We can have Linux completely remap the PCIe bus (*reallocate*) and ignore the erroneous restrictions of the old BIOS.
By passing the following parameters to the bootloader in /etc/default/grub, I forced the kernel to take action:
GRUB\_CMDLINE\_LINUX\_DEFAULT="loglevel=3 quiet pci=realloc,nocrs pcie\_aspm=off"
- pci=realloc,nocrs: This is the secret sauce. It forces Linux to allocate the memory blocks for the Tesla card itself and ignores the limits that the BIOS tries to impose.
- pcie\_aspm=off: Prevents aggressive power management from accidentally putting the server card into sleep mode.
After a grub-mkconfig and a reboot, the miracle happened: modprobe nvidia remained silent and **nvidia-smi** proudly showed off a perfectly working Tesla M40\!
Docker hurdles: From Whack-A-Mole to CDI
Now that the host driver was working, the next step was to throw LocalAI into Docker to host models like Gemma-4 locally.
However, the traditional method (--gpus all via the Nvidia Container Toolkit) quickly ended up in a frustrating game of whack-a-mole. Since we were running on the legacy 580xx drivers, Docker kept looking for modern .so libraries that simply don't exist for this card:
error during container init: failed to fulfill mount request: open /usr/lib/libnvidia-cbl.so... no such file or directory
Instead of creating dozens of manual symlinks, I switched to the modern CDI (Container Device Interface) method. With this, the toolkit scans the host and generates a watertight configuration that exactly matches what is actually on the system:
sudo nvidia-ctk cdi generate \--output=/etc/cdi/nvidia.yaml
After activating CDI in the Docker daemon.json, the GPU could be passed directly and error-free into the compose.yaml:
deploy:
resources:
reservations:
devices:
\- driver: cdi
device\_ids:
\- nvidia.com/gpu=all
capabilities: \[gpu\]
CUDA Architecture Mismatch (The final blow)
The container started, the LocalAI UI opened, but when loading gemma-3-1b-it the backend crashed with the message:
CUDA error: no kernel image is available for execution on the device
The culprit: LocalAI's default CUDA 13 images are compiled for modern cards (Compute Capability from 7.5). However, my Tesla M40 (Maxwell architecture) specifically requires Compute Capability 5.2.
By simply downgrading the Docker image in the Compose file to the more widely supported CUDA 12 variant (localai/localai:latest-gpu-nvidia-cuda-12), the mismatch was resolved.
The Result
The result is a super stable, dirt-cheap AI playground. The LocalAI container now has full, exclusive access to the 24GB of VRAM on the Tesla M40. Thanks to a 3D-printed shroud and a sturdy fan, temperatures remain well within limits, and LLMs run locally, privately and nice and fast on hardware that had actually been written off for a long time. The final compose.yaml looks like this:
services:
localai:
image: localai/localai:master-gpu-nvidia-cuda-12
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1m
timeout: 20m
retries: 5
ports:
- 8510:8080
environment:
- DEBUG=false
volumes:
- /mnt/storage/localai/models:/models
- /mnt/storage/localai/backends:/backends
- /mnt/storage/localai/backends:/usr/share/localai/backends
- /mnt/storage/localai/images:/tmp/generated/images
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all
capabilities: [gpu]
Conclusion: Never let a missing button in your BIOS or an error message in Docker stop you. With the right kernel parameters and the modern CDI standard, you can transform 2013 hardware into a modern AI powerhouse.
Do you also have an 'impossible' hardware challenge in your home lab or company? Send me a message.