Model Architecture

13 items

RESEARCHarXiv CS.LG·1d ago

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

The paper introduces WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. It improves upon standard residual connections by augmenting each block with directional detail bases that contrast attention and MLP updates, and early-vs-late sublayer dynamics.

Residual Connections neural networks deep learning Model Architecture

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Forgive my ignorance but how is a 27B model better than 397B?

A user expresses confusion regarding how a 27B dense model could outperform a 397B Mixture-of-Experts (MoE) model, specifically mentioning Qwen, and questions the utility of the additional experts.

AI models Model Architecture MoE Qwen

Forgive my ignorance but how is a 27B model better than 397B?

NEWS↑ trendingReddit r/LocalLLaMA·5/7/2026

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

The content announces the addition of Mimo v2.5 model support in llama.cpp and describes its architecture. MiMo v2.5 is a Sparse MoE model with 310B total and 15B activated parameters, supporting text, image, video, and audio modalities with an impressive context length.

multimodal AI Model Architecture llama.cpp MoE

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

ARTICLE↑ trendingReddit r/MachineLearning·4/26/2026

Can Geometric Deep Learning lead eliminate the need of "Brute Force" pre-training [D]

The author questions whether Geometric Deep Learning, by explicitly building symmetries and invariances into its architecture, could significantly reduce or eliminate the need for extensive, data-intensive pre-training. This raises the question of whether current massive-scale pre-training is largely a consequence of architectures lacking inherent invariance.

pre-training Symmetry Model Architecture Geometric Deep Learning

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

If Dense Models are better for Coding, why are Qwen-Coders MoE?

The author questions Qwen's decision to use the Mixture-of-Experts (MoE) architecture for its coding models, instead of more accurate dense models. They speculate the choice might be related to inference speed and regret the absence of a 14B successor.

Model Architecture coding AI MoE AI

RESEARCHarXiv CS.LG·4/23/2026

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Expert Upcycling proposes a method to progressively expand Mixture-of-Experts (MoE) capacity in large language models during continued pre-training. It increases the number of experts via duplication and router extension to provide a warm initialization, aiming to reduce training costs while preserving per-token inference cost.

Model Architecture training-optimization large language models

RESEARCHDEV.to AI·4/23/2026

qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.

The Qwen3.6-27B dense model outperformed the Qwen3.6-35B-A3B MoE model on SWE-bench, scoring 77.2% versus 73.4%. This indicates that dense models may be proving more effective for real-world software engineering tasks.

AI models Model Architecture Benchmarks MoE

ARTICLEDEV.to AI·4/26/2026

DeepSeek V4: Million-Token Context That Actually Works

DeepSeek V4 delivers a 1 million-token context that is actually usable, solving the GPU memory issue with a hybrid attention architecture that compresses the KV cache by nearly 9x. This makes it a practical solution for long-context inference, unlike many other models.

DeepSeek AI models Model Architecture large language models

RESEARCHarXiv CS.CL·5/1/2026

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

This paper introduces the Length Value Model (LenVM), a novel token-level framework for modeling the remaining generation length in autoregressive models. By formulating length modeling as a value estimation problem, LenVM provides an annotation-free, scalable, and effective signal for LLMs and VLMs, improving performance on exact length matching tasks.

deep learning Model Architecture computer vision large language models

RESEARCHarXiv CS.CL·27d ago

The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

The Bicameral Model couples two frozen, pretrained language models via a trainable neural interface on their intermediate hidden states, allowing them to operate in lockstep. This method enables a primary model to drive a task while an auxiliary model uses tools or solves constraints, significantly improving accuracy on tasks like arithmetic and logic puzzles.

neural networks language models AI models Model Architecture

RESEARCHAI at Meta (YouTube)·12/8/2025

SAM 3: Building a unified model architecture for detection and tracking

SAM 3 focuses on building a unified model architecture for detection and tracking tasks. It aims to improve efficiency and accuracy in computer vision applications.

Model Architecture object detection machine learning computer vision

SAM 3: Building a unified model architecture for detection and tracking

ARTICLEAI at Meta (YouTube)·11/20/2025

SAM 3D: Behind the two-model design | AI at Meta

This article delves into the two-model design powering SAM 3D, an AI initiative from Meta. It explains the architectural choices and engineering rationale behind this AI system.

AI models SAM 3D Model Architecture Meta AI

SAM 3D: Behind the two-model design | AI at Meta

NEWSDEV.to AI·17d ago

Topology rewrite not bug repair

The topology rewrite for an AI system or model is a fundamental reformulation, not just a bug fix. Further details on this development will be shared as the build matures.

topology Model Architecture Software Engineering bug fix