← heapsort-ai

Vision-Language Models

25 items

ARTICLE↑ trendingReddit r/MachineLearning·4/9/2026

Detecting mirrored selfie images: OCR the best way? [D]

O usuário busca um método eficaz para detectar texto invertido em selfies antes de passá-los para modelos de Visão-Linguagem (VLM) ou extratores de embedding facial, que são insensíveis a essa inversão devido ao treinamento com dados aumentados. Sua ideia é usar OCR (EasyOCR) para comparar a pontuação de leitura de texto normal versus espelhado, questionando se esta é a melhor abordagem ou se existe uma solução de modelo menor e mais inteligente.

40
RESEARCHarXiv CS.CL·4/24/2026

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

Vision-Language Models (VLMs) often misinterpret interactive charts due to a "Pixel-Only Bottleneck," treating them as static images. This paper introduces Introspective and Interactive Visual Grounding (IVG), a framework combining spec-grounded introspection and view-grounded interaction to resolve visual ambiguities, significantly improving QA accuracy.

30
RESEARCHarXiv CS.AI·27d ago

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

This research paper demonstrates that embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across multiple VLMs. Layer-wise probing reveals that optimal layers for quality prediction are deeper than where anchor classification saturates, establishing a causal account of visual anchoring bias.

29
ARTICLEDEV.to AI·28d ago

Fine-tuning CLIP on a Niche Domain: How I Got +26pp Accuracy on Architectural Styles and What You Can Apply to Your Own Domain

This article details the process of fine-tuning OpenCLIP ViT-B/32 for architectural styles, achieving a +26 percentage point increase in accuracy. The author focuses on the critical decisions made before and after the training loop that were responsible for this significant result, rather than the training loop optimization itself.

27
RESEARCHarXiv CS.CL·4/10/2026

Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

Este artigo propõe o DLR, um framework de raciocínio latente reforçado para Vision-Language Models (VLMs) que melhora o raciocínio visual complexo, superando a perda de informação em CoT textual. Ele decompõe dinamicamente consultas, extrai latentes visuais e deduz respostas, oferecendo maior interpretabilidade e superando baselines em benchmarks vision-centric.

27
RESEARCHarXiv CS.CL·4/8/2026

Document Optimization for Black-Box Retrieval via Reinforcement Learning

Este artigo de pesquisa propõe uma nova abordagem para otimização de documentos, transformando-os para melhor alinhamento com sistemas de recuperação via Reinforcement Learning (GRPO), utilizando melhorias de ranking como recompensa. O método, aplicável a retrievers de caixa preta, demonstrou ganhos em tarefas de recuperação de código e documentos visuais.

27
RESEARCHarXiv CS.AI·28d ago

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

This research tests the "Attention-Confidence Assumption" in Vision-Language Models (VLMs), finding that attention structure is a near-zero predictor of correctness. The study uses a unified mechanistic pipeline (VLM Reliability Probe) to analyze attention, generation dynamics, and hidden-state geometry in three VLM families.

27