CUDA

4 items

ARTICLE↑ trendingReddit r/MachineLearning·4/30/2026

A Hackable ML Compiler Stack in 5,000 Lines of Python [P]

The author built a simplified, hackable ML compiler stack in 5,000 lines of Python that emits raw CUDA, aiming to provide an easy-to-follow reference without the complexity of existing frameworks. It lowers small models like TinyLlama and Qwen2.5-7B through six Intermediate Representations, focusing on clarity over performance.

CUDA ML compiler compiler design Python

ARTICLE↑ trendingReddit r/MachineLearning·4/13/2026

TurboOCR: 270–1200 img/s OCR with Paddle + TensorRT (C++/CUDA, FP16) [P]

TurboOCR achieves 270–1200 img/s OCR by optimizing PaddleOCR with C++/CUDA, FP16 TensorRT, fused kernels, and batched processing, addressing the performance bottlenecks of VLM-based approaches. This solution drastically improves throughput for large-scale document processing and is suitable for real-time RAG applications.

CUDA Performance optimization TensorRT C++

ARTICLEDEV.to AI·5/3/2026

I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

A developer created a custom CUDA inference engine to successfully run the Qwen3.5-27B large language model on low-cost, repurposed mining graphics cards. This innovative approach demonstrates significant hardware optimization, making powerful AI models more accessible on affordable consumer-grade hardware.

CUDA Optimization inference hardware

ARTICLEDEV.to AI·4/9/2026

I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090

O autor detalha a otimização de um sistema Qwen3-TTS, que reduziu a latência de 35 segundos para 50 milissegundos TTFC e 0.17 RTF em uma RTX 5090. Isso foi conseguido com apenas três linhas de código alteradas em um kernel CUDA, viabilizando a síntese de fala em tempo real para conversas naturais.

CUDA Hardware AI Otimização Baixa Latência