← heapsort-ai

natural language processing

167 items

RESEARCHarXiv CS.CL·19h ago

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

This research proposes an unsupervised method to identify community-specific slang and unique entities by analyzing the magnitude of semantic shift. Semantic shift is defined as the evolution of a word's encoded representation after fine-tuning a pre-trained Large Language Model (LLM) on a community-specific text corpus.

54
RESEARCHarXiv CS.CL·19h ago

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

This study presents the first application of a Retrieval Augmented Generation (RAG) model for Nepali legal question answering, addressing data scarcity in low-resource languages. Using BM25 on chunked documents, the RAG pipeline achieved high precision and truthfulness, demonstrating its effectiveness in the Nepali legal domain.

54
ARTICLE↑ trendingReddit r/MachineLearning·4/18/2026

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

easyaligner is a new, performant forced alignment library offering GPU acceleration and flexible text normalization, compatible with all w2v2 models on Hugging Face Hub. It addresses common challenges in speech-to-text preprocessing, such as handling partial transcripts, irrelevant audio, and long segments without chunking.

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]
46
RESEARCH↑ trendingReddit r/MachineLearning·4/24/2026

New project about llm hallucination [P]

This content introduces a new side project and its GitHub repository, focusing on mitigating LLM hallucination through a novel contrastive sampling and selective training method. The core idea treats hallucination as a preference problem, using self-generated negative samples and divergence-based, gated learning to push correct answers and suppress wrong ones.

New project about llm hallucination [P]
45
RESEARCH↑ trendingReddit r/LocalLLaMA·4/10/2026

National University of Singapore Presents "DMax": A New Paradigm For Diffusion Language Models (dLLMs) Enabling Aggressive Parallel Decoding.

DMax é um novo paradigma para modelos de linguagem de difusão (dLLMs) eficientes que mitiga o acúmulo de erros na decodificação paralela. Ele permite um paralelismo agressivo ao reformular a decodificação como um processo de auto-refinamento progressivo e introduzir uma estratégia de treinamento unificada.

44
ARTICLE↑ trendingReddit r/LocalLLaMA·19d ago

Qwen3.6 35Ba3 has changed my workflows and even how I use my computer

The author details how the Qwen3.6 35Ba3 AI model has profoundly reshaped their development workflows and computer usage, enabling them to automate complex tasks and interact with the operating system using natural language. This transformation allows them to delegate tasks like devops, content creation, and code testing to AI, highlighting a significant shift in productivity.

42
RESEARCHarXiv CS.CL·1d ago

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

The HKJudge project introduces the first sentence-level, expert-annotated legal discourse corpus of Hong Kong criminal judgments, comprising approximately 290k sentences. It utilizes a two-tier discourse schema to identify what courts find, how they reason, and what they rule, with high inter-annotator agreement.

40
RESEARCHarXiv CS.CL·4/21/2026

Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

This foundational study explores authorship attribution using stylistic features to support actor analysis in threat intelligence, testing methods on Japanese web reviews. While BERT fine-tuning performed best overall, TF-IDF with logistic regression showed superior stability and accuracy when scaling to hundreds of authors.

36