← heapsort-ai

Natural Language Processing

168 items

RESEARCHarXiv CS.CL·1d ago

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

The HKJudge project introduces the first sentence-level, expert-annotated legal discourse corpus of Hong Kong criminal judgments, comprising approximately 290k sentences. It utilizes a two-tier discourse schema to identify what courts find, how they reason, and what they rule, with high inter-annotator agreement.

36
RESEARCHarXiv CS.CL·4/22/2026

Model-Agnostic Meta Learning for Class Imbalance Adaptation

This paper introduces Hardness-Aware Meta-Resample (HAMR), a unified framework that adaptively addresses class imbalance and data difficulty in NLP tasks. HAMR employs bi-level optimizations and a neighborhood-aware resampling mechanism to prioritize genuinely challenging samples and minority classes, showing substantial improvements on diverse imbalanced datasets.

35
DOCDEV.to AI·4/16/2026

LLM vs RAG

This content compares LLMs (Large Language Models) and RAG (Retrieval-Augmented Generation), outlining their core differences in terms of type, knowledge source, accuracy, and use cases. It explains that RAG enhances LLMs' factual grounding by integrating external, real-time data, thus mitigating hallucinations.

31
RESEARCHarXiv CS.CL·4/16/2026

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

This study classifies sentiment in English and Bangla reviews of Bangladeshi government mobile banking apps, using a hybrid labeling approach for 5,652 reviews. It found that traditional machine learning models like Random Forest and Linear SVM significantly outperformed fine-tuned XLM-RoBERTa for this specific task.

31
RESEARCHarXiv CS.CL·4d ago

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

This paper introduces a hybrid pre-training objective for text encoders, combining a JEPA-style latent-space prediction loss with a standard Masked Language Modelling (MLM) objective. This new approach aims to encourage representations anchored to deeper semantic structure rather than just surface-form token identity, showing significantly more uniform embeddings.

30
RESEARCHDEV.to AI·4/13/2026

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive EffectiveReinforcement Learning for LLM Reasoning

This content explores a novel approach to improve Reinforcement Learning for Large Language Model (LLM) reasoning by focusing on "high-entropy minority tokens". It proposes that these less frequent yet highly informative tokens are key drivers for effective learning, challenging the conventional 80/20 rule.

29
ARTICLEDEV.to AI·27d ago

Everything Google announced at its Android Show, from Googlebooks to vibe-coded widgets

The article technically analyzes Google's Android Show announcements, focusing on the new Google Books app and vibe-coded widgets. It details how Google Books uses a proprietary rendering engine with ML for text recognition, while vibe-coded widgets leverage NLP and computer vision via TensorFlow Lite for personalized experiences.

29