HPC & Custom Hardware
Engineered for the most demanding workloads in AI training, inference, and high-frequency trading. Modulus provides custom silicon, accelerator-class compute, and HPC software tuned to the architecture.
Modulus engineers high-performance systems for the most demanding workloads in capital markets, AI, and scientific computing. Our work has powered platforms for J.P. Morgan Chase, Bank of America, UBS, Charles Schwab, and Nasdaq, alongside organizations in space exploration, healthcare, national security, and defense.
Most problems can be solved at the software layer, in C, C++, or accelerated GPU code targeting CUDA, ROCm, or oneAPI. When those approaches are not fast enough, our hardware and HPC engineers design custom silicon. From high-frequency trading to deep-learning AI, they deliver high-performance systems built on the latest hardware.

FPGA & ASIC engineering
Our flagship machine learning platform pairs server-class CPUs with FPGA accelerators on AMD Versal and Intel Agilex devices, attached over a coherent fabric using OpenCAPI and CXL 3.x. We design accelerator kernels in VHDL and modern high-level synthesis flows such as Vitis HLS and Catapult HLS, validate on the bench, and harden to production. The platform carries terabytes of DDR5 memory, hosts in-house deep-learning algorithms, and runs complex workloads in under 30 nanoseconds, up to 250 times faster than highly optimized C. Multi-port optical interconnects deliver 140 Gbps full-duplex throughput.
When FPGA performance is not enough, we engineer custom ASIC solutions at 5nm, 3nm, and 2nm process nodes. An ASIC requires a larger up-front investment of time and capital, but it can deliver up to ten times the throughput of an FPGA design. We also build training and inference systems on the latest commercial accelerator silicon, including NVIDIA Blackwell (B200, GB200, and GB300 NVL72), AMD Instinct MI300X, MI325X, and MI350, Intel Gaudi 3, Cerebras WSE-3, and Groq LPU.
- FPGA accelerators on AMD Versal and Intel Agilex
- VHDL alongside Vitis HLS and Catapult HLS flows
- Custom ASIC design at 5nm, 3nm, and 2nm
- Coherent accelerator attach via OpenCAPI and CXL 3.x
- HBM3e on accelerator silicon, DDR5 on host
- 140 Gbps full-duplex optical interconnect
- Sub-30-nanosecond inference latency
Supercomputing software
Supercomputers tackle the hardest simulations, analytics, and AI workloads, from protein-folding prediction and climate modeling to trillion-parameter language model training. To deliver real performance, the software running on them must be tuned to the architecture in play.
Our engineers ship code for HPE Cray EX with Slingshot 11, IBM, Fujitsu, Dell, Eviden, and Lenovo systems, on the same architectures that power Frontier, El Capitan, and Aurora. We optimize across InfiniBand NDR and XDR, Ultra Ethernet, and RoCE v2 fabrics, integrate parallel filesystems including Lustre, DAOS, WekaFS, and VAST Data, and build pipelines on PyTorch, JAX, Megatron-LM, DeepSpeed, vLLM, and Ray.
Proven in mission operations: NASA
NASA Mission Operations needed to port high-performance desktop software, originally written in C, onto tablet devices capable of displaying real-time telemetry and health data streamed from the International Space Station. The target was demanding: half a billion data points per second of ISS health and telemetry, processed and rendered on hardware with limited compute. NASA evaluated numerous solutions before selecting Modulus to build the system.
Modulus designed and patented a method for compressing time-series data into a custom display format, optimized for the CPU's cache and instruction pipeline, and integrated it into our charting library. The tablet solution matched the throughput of the original desktop software while running on a fraction of the compute.

Deep learning at the core
Modern AI runs on transformer architectures trained at trillion-parameter scale and served at production latency. Modulus engineers build, fine-tune, and deploy these systems across the full stack: GPU clusters connected by NVLink 5 and InfiniBand, distributed training with Megatron-LM and DeepSpeed, and high-throughput inference through vLLM, SGLang, and Triton Inference Server.
Modulus has been immersed in machine learning and high-performance computing for more than three decades. We have contributed to the deep-learning foundations in use today by industry and research pioneers, and we know how to put them to work in practical, high-value ways.
HFT, DMA & quant systems
We build trading systems that combine multidimensional nonlinear modeling, deep neural networks, kernel regression, dynamic programming via genetic algorithms, and other compute-intensive methods to produce dynamic strategies for equities, futures, options, forex, bonds, and digital assets.
Across the buy side and sell side, our engineers have discreetly built systems for some of the largest hedge funds in the industry: high-frequency data acquisition, algorithmic execution, smart order routing, market making, and ultra-low-latency network design using DPUs such as NVIDIA BlueField-3 and AMD Pensando, FPGA NICs from the AMD X3 series, and kernel-bypass transports including DPDK and RDMA over Converged Ethernet. End-to-end network latency reaches as low as 20 nanoseconds.
- Turnkey broker-dealer platforms with source code
- Matching engines for equities, futures, and forex
- Direct connections to major exchanges including CME and Nasdaq
- WebSocket data broadcasting for web and mobile apps
- RAM database and memory-mapped file systems
- FIX and FAST protocol development
What we engineer
From custom silicon to supercomputing software, our hardware and HPC engineers cover the full stack of ultra-high-performance computing.
FPGA acceleration
Deep-learning accelerators on AMD Versal and Intel Agilex, designed in VHDL and Vitis HLS, delivering sub-30-nanosecond inference for the most latency-sensitive workloads.
Custom ASIC design
Application-specific silicon at 5nm, 3nm, and 2nm process nodes, achieving up to ten times the throughput of an equivalent FPGA when FPGA performance is not enough.
Accelerator-class systems
Training and inference platforms built on NVIDIA Blackwell, AMD Instinct MI300X and MI350, Intel Gaudi 3, Cerebras WSE-3, and Groq LPU, with NVLink, NVSwitch, and InfiniBand fabric.
Supercomputing software
Architecture-tuned software for HPE Cray EX, IBM, Fujitsu, Dell, Eviden, and Lenovo systems, with parallel filesystems including Lustre, DAOS, WekaFS, and VAST Data.
Time-series compression
Patented compression that renders hundreds of millions of data points at extreme speed, proven on NASA tablet telemetry streamed from the ISS.
Ultra-low-latency networks
Twenty-nanosecond fabric built with DPUs, FPGA NICs, and RDMA over Converged Ethernet, engineered for high-frequency trading and tightly coupled AI training.
Real-time AI, powered by HPC
The same HPC engineering also powers Modulus real-time AI systems, grounding Large Language Models in live, verified data with ultra-low latency.
Real-Time AI Truth
How Modulus bridges Generative AI and high-velocity, mission-critical data, with ultra-low latency and reduced hallucinations.
Capabilities & Solutions
Natural-language queries against live data streams for finance, defense, healthcare, cybersecurity, real estate, and logistics.
Hybrid HPC/AI Architecture
De-coupling logic from language: the HPC layer does the math and verification while the LLM handles intent and response.
Platforms & tooling
The hardware platforms, frameworks, and protocols our engineers work in every day.
Compute & silicon
- NVIDIA Blackwell B200, GB200, GB300 NVL72
- NVIDIA Hopper H100 and H200
- AMD Instinct MI300X, MI325X, MI350
- Intel Gaudi 3, Cerebras WSE-3, Groq LPU
- AMD Versal and Intel Agilex FPGAs
- Custom ASIC at 5nm, 3nm, and 2nm
Fabric & I/O
- NVLink 5 and NVSwitch for GPU-to-GPU
- InfiniBand NDR (400G) and XDR (800G)
- Ultra Ethernet and HPE Slingshot 11
- OpenCAPI and CXL 3.x coherent attach
- BlueField-3 DPUs and AMD Pensando
- GPUDirect RDMA, RoCE v2, and DPDK
Software & ML stack
- PyTorch, JAX, and TensorFlow
- Megatron-LM and DeepSpeed for distributed training
- vLLM, SGLang, and Triton Inference Server
- Ray, MLflow, and Kubeflow
- Lustre, DAOS, WekaFS, and VAST Data
- Apache Kafka and Spark for data pipelines
The engine for real-time AI truth
Modulus real-time AI grounds Large Language Models in live, verified HPC data, delivering ultra-low latency responses with up-to-date accuracy while virtually eliminating hallucinations. The same engineering behind our custom hardware powers natural-language queries against high-velocity data streams for finance, defense, healthcare, cybersecurity, and logistics.
Technology we use
Let's build.
Request an instant meeting or schedule a call with our hardware team to discuss your custom hardware project.