INT3 compression+fused metal kernels [R]
A solo founder developed INT3 model compression and a 2-bit KV cache with custom fused Metal kernels for Mac (M-series). Qwen 7B is available in preview, and further optimizations and GPU support are planned.
A solo founder developed INT3 model compression and a 2-bit KV cache with custom fused Metal kernels for Mac (M-series). Qwen 7B is available in preview, and further optimizations and GPU support are planned.
The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.
A transformer language model (TinyStories-260K) was successfully run locally on a stock Game Boy Color, utilizing INT8 weights and fixed-point math. This impressive technical feat involved a custom ROM and on-device tokenization, though performance is extremely slow and output is gibberish.

OpenBMB presented the BitCPM-CANN 1.58 bit model. New models are being tested on the Huawei Ascend 910B.

A demonstration of the Gemma 4 VLA model running on the Jetson Orin Nano Super device.
This article reviews the evolution of Deep Learning implementation on FPGAs, covering its historical development, current state, and future directions. It also highlights the critical importance of hardware acceleration for the advancement of artificial intelligence.
This survey paper examines various techniques and methods for accelerating Convolutional Neural Network (CNN) inference specifically on Field-Programmable Gate Arrays (FPGAs). It provides an overview of existing research and architectural approaches to improve the performance and efficiency of CNN deployments on hardware.