Language modeling

5 items

RESEARCH↑ trendingReddit r/MachineLearning·4/13/2026

I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R]

An 18-year-old indie developer scaled a pure Spiking Neural Network (SNN) to 1.088 billion parameters from scratch for language modeling, achieving loss convergence despite common beliefs about vanishing gradients. Key findings include maintaining 93% sparsity and the unexpected emergence of structurally correct Russian text, though the experiment was halted due to budget constraints.

Spiking Neural Networks AI scaling large language models Language modeling

RESEARCHarXiv CS.LG·5d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

This research systematically evaluates variants of the Query, Key, and Value (QKV) attention formulation in Transformers, including shared key-value, query-key, and single projections. Experiments across synthetic, vision, and language modeling tasks demonstrate that these alternative formulations perform on par or occasionally better than standard QKV Transformers, with Q-K=V sharing offering significant KV cache reduction in language modeling.

QKV computer vision attention mechanisms Language modeling

RESEARCHarXiv CS.LG·4/9/2026

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

Este artigo apresenta Probabilistic Language Tries (PLTs), uma representação unificada que explicita a estrutura de prefixo de qualquer modelo generativo sobre sequências. PLTs atuam como compressor lossless ideal, representação de política para problemas de decisão sequencial (como jogos e robótica) e índice de memoização para reuso de execução, com um teorema chave sobre caching guiado por prior.

sequence generation reinforcement learning data compression Probabilistic Models

RESEARCHarXiv CS.AI·4/17/2026

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

This paper investigates whether routing topology actually determines language modeling quality in Mixture-of-Experts (MoE) architectures. The authors found that different routing variants, including a novel cosine-similarity based one, result in statistically equivalent asymptotic perplexity, suggesting that routing design has a smaller impact on final quality than previously thought.

neural networks routing algorithms Mixture of Experts Language modeling

RESEARCHarXiv CS.AI·11d ago

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and cognitive science inspirations. It achieved a 12% relative reduction in perplexity on WikiText-103 compared to a fine-tuned GPT-2 Small baseline, with 84% of the architectural improvement attributed to GT-Full simplicial message passing.

Transformer Architecture cognitive science GPT-2 Category Theory