Vision-Language Models

25 items

ARTICLE↑ trendingReddit r/MachineLearning·4/20/2026

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

An independent researcher created SGOCR, an open-source dataset pipeline for spatially-grounded, OCR-focused VQA, to fill a gap in visual datasets for text grounding in imagery. This pipeline generates VQA tuples with rich metadata, supporting diverse VLM training strategies.

Open Source Vision-Language Models datasets OCR

ARTICLE↑ trendingReddit r/MachineLearning·4/9/2026

Detecting mirrored selfie images: OCR the best way? [D]

O usuário busca um método eficaz para detectar texto invertido em selfies antes de passá-los para modelos de Visão-Linguagem (VLM) ou extratores de embedding facial, que são insensíveis a essa inversão devido ao treinamento com dados aumentados. Sua ideia é usar OCR (EasyOCR) para comparar a pontuação de leitura de texto normal versus espelhado, questionando se esta é a melhor abordagem ou se existe uma solução de modelo menor e mais inteligente.

AI models Image processing Vision-Language Models computer vision

RESEARCHarXiv CS.CL·4/24/2026

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

Vision-Language Models (VLMs) often misinterpret interactive charts due to a "Pixel-Only Bottleneck," treating them as static images. This paper introduces Introspective and Interactive Visual Grounding (IVG), a framework combining spec-grounded introspection and view-grounded interaction to resolve visual ambiguities, significantly improving QA accuracy.

AI accuracy Vision-Language Models Visual Grounding Benchmarking

RESEARCHarXiv CS.AI·27d ago

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

This research paper demonstrates that embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across multiple VLMs. Layer-wise probing reveals that optimal layers for quality prediction are deeper than where anchor classification saturates, establishing a causal account of visual anchoring bias.

neural networks Vision-Language Models Model Evaluation representation learning

CASEAWS Machine Learning Blog·5/6/2026

Cost effective deployment of vision-language models for pet behavior detection on AWS Inferentia2

Pet-tech startup Tomofun is leveraging EC2 Inf2 instances powered by AWS Inferentia2 for cost-effective deployment of vision-language models for pet behavior detection. This strategy allows the company to significantly reduce costs while maintaining the accuracy of its systems.

Vision-Language Models AWS Inferentia2 pet tech AI deployment

RESEARCHDEV.to AI·4/19/2026

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

F-VLM introduces a novel approach for open-vocabulary object detection by efficiently leveraging frozen pre-trained vision and language models. This method allows for identifying a wide range of objects without requiring specific training data for each new category.

Vision-Language Models deep learning object detection computer vision

ARTICLEDEV.to AI·28d ago

Multimodal AI Applications in 2026

This article discusses the evolution of multimodal AI models, which are transitioning from research to production APIs by 2026, integrating text, images, audio, and video. It covers current capabilities, architectures, and production patterns for these applications, featuring models like GPT-4o and Claude.

AI applications AI models multimodal AI Vision-Language Models

RESEARCHDEV.to AI·20d ago

PaliGemma 2: A Family of Versatile VLMs for Transfer

PaliGemma 2 is introduced as a new family of versatile Vision-Language Models (VLMs) specifically designed to excel in various transfer learning applications. This advancement aims to improve performance across diverse multimodal tasks through effective knowledge transfer.

AI models Vision-Language Models VLMs Transfer Learning

RESEARCHarXiv CS.AI·4/17/2026

Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

GazeX is a new vision language model trained on radiologists' eye-tracking data and reasoning to improve chest X-ray interpretation. The model learns to emulate expert spatial and temporal attention, aiming to bridge the gap between model outputs and clinical diagnostic reasoning.

Vision-Language Models computer vision medical AI diagnostic tools

RESEARCHDEV.to AI·24d ago

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

MobileVLM V2 introduces a new and enhanced baseline for vision language models, focusing on faster performance and stronger capabilities. This research aims to advance the efficiency and robustness of VLMs on mobile platforms.

AI models Vision-Language Models research deep learning

RESEARCHDEV.to AI·29d ago

Spatial Forcing: Implicit Spatial Representation Alignment forVision-language-action Model

The paper introduces 'Spatial Forcing,' a method for aligning implicit spatial representations in vision-language-action models. It aims to enhance these models' understanding and interaction with spatial information.

AI models Vision-Language Models machine learning robotics

ARTICLEDEV.to AI·28d ago

Fine-tuning CLIP on a Niche Domain: How I Got +26pp Accuracy on Architectural Styles and What You Can Apply to Your Own Domain

This article details the process of fine-tuning OpenCLIP ViT-B/32 for architectural styles, achieving a +26 percentage point increase in accuracy. The author focuses on the critical decisions made before and after the training loop that were responsible for this significant result, rather than the training loop optimization itself.

CLIP Vision-Language Models machine learning computer vision

RESEARCHarXiv CS.LG·5/5/2026

GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

GAZE is a framework enabling medical Vision-Language Models (VLMs) to iteratively analyze brain MRI images using viewer-level tools and literature retrieval. It achieved 58.2 mAP for lesion localization and 34.9% Top-1 diagnostic accuracy on the NOVA benchmark for rare neurological conditions.

Vision-Language Models neurology Benchmarking medical AI

RESEARCHarXiv CS.CL·4/10/2026

Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

Este artigo propõe o DLR, um framework de raciocínio latente reforçado para Vision-Language Models (VLMs) que melhora o raciocínio visual complexo, superando a perda de informação em CoT textual. Ele decompõe dinamicamente consultas, extrai latentes visuais e deduz respostas, oferecendo maior interpretabilidade e superando baselines em benchmarks vision-centric.

Vision-Language Models visual reasoning Reinforced Latent Reasoning Chain-of-Thought

RESEARCHarXiv CS.CL·4/27/2026

Source-Modality Monitoring in Vision-Language Models

This research defines and investigates source-modality monitoring in Vision-Language Models (VLMs), examining their ability to track the origin of information. It evaluates how VLMs use syntactic and semantic signals to bind input sources, finding both are crucial but semantic signals often dominate, with implications for model robustness.

model robustness multimodal AI Vision-Language Models

RESEARCHarXiv CS.CL·4/27/2026

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

This work explores neuro-symbolic language reasoning in VLMs, leveraging Reinforcement Learning to improve analytical abilities and efficiency. It achieved a 3.33% accuracy increase on a vision-language evaluation dataset while reducing reasoning tokens by 75%.

Vision-Language Models reinforcement learning Reasoning Neuro-symbolic AI

RESEARCHarXiv CS.CL·4/8/2026

Document Optimization for Black-Box Retrieval via Reinforcement Learning

Este artigo de pesquisa propõe uma nova abordagem para otimização de documentos, transformando-os para melhor alinhamento com sistemas de recuperação via Reinforcement Learning (GRPO), utilizando melhorias de ranking como recompensa. O método, aplicável a retrievers de caixa preta, demonstrou ganhos em tarefas de recuperação de código e documentos visuais.

language models Vision-Language Models reinforcement learning document optimization

RESEARCHarXiv CS.LG·7d ago

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Researchers propose Demo2Reward, a test-time adaptation technique to optimize Vision-Language Model (VLM) reward models in robotics. It uses a few demonstrations to reduce false positives while preserving true positives, without requiring additional model training.

Vision-Language Models reinforcement learning Prompt Optimization robotics

RESEARCHarXiv CS.AI·28d ago

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

This research tests the "Attention-Confidence Assumption" in Vision-Language Models (VLMs), finding that attention structure is a near-zero predictor of correctness. The study uses a unified mechanistic pipeline (VLM Reliability Probe) to analyze attention, generation dynamics, and hidden-state geometry in three VLM families.

Vision-Language Models Mechanistic Interpretability attention mechanisms AI reliability

RESEARCHarXiv CS.LG·29d ago

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

A new training-free inference framework, Positive-and-Negative Decoding (PND), is introduced to address object hallucination in Vision-Language Models (VLMs). PND enforces visual fidelity by using a dual-path contrast mechanism, leading to state-of-the-art performance without retraining.

multimodal AI hallucination Vision-Language Models decoding