Relevance Newest Most Cited
20 results
arxiv.org πŸ“… 2022 πŸ“° arXiv πŸ“„ PDF
A global analysis of metrics used for measuring performance in natural language processing
πŸ‘€ Kathrin Blagec; Georg Dorffner; Milad Moradi; Simon Ott; Matthias Samwald

Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, originally devised for machine translation and summarization, have been shown to suffer from low correlation with human judgment and a lack of transferability to other tasks and languages.…

cs.CL cs.AI
arxiv.org πŸ“… 2023 πŸ“° arXiv πŸ“„ PDF
Can Large Language Models design a Robot?
πŸ‘€ Francesco Stella; Cosimo Della Santina; Josie Hughes

Large Language Models can lead researchers in the design of robots.…

cs.RO
semanticscholar.org πŸ“… 2025 πŸ“° Nature πŸ”– 5,401 citations
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
πŸ‘€ DeepSeek-AI; Daya Guo; Dejian Yang; Haowei Zhang; Jun-Mei Song; Ruoyu Zhang; R. Xu; Qihao Zhu; Shirong Ma; Peiyi Wang; Xiaoling Bi; Xiaokang Zhang; Xingkai Yu; Yu Wu; Z. F. Wu; Zhibin Gou; Zhihong Shao; Zhuoshu Li; Ziyi Gao; A. Liu; Bing Xue; Bing-Li Wang; Bochao Wu; B. Feng; Chengda Lu; Chenggang Zhao; C. Deng; Chenyu Zhang; C. Ruan; Damai Dai; Deli Chen; Dong-Li Ji; Erhang Li; Fangyun Lin; Fucong Dai; Fuli Luo; Guangbo Hao; Guanting Chen; Guowei Li; H. Zhang; Han Bao; Hanwei Xu; Haocheng Wang; Honghui Din

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily continge…

DOI: 10.1038/s41586-025-09422-z
arxiv.org πŸ“… 2025 πŸ“° arXiv πŸ“„ PDF
Liars' Bench: Evaluating Lie Detectors for Language Models
πŸ‘€ Kieron Kretschmar; Walter Laurito; Sharan Maiya; Samuel Marks

Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generate statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 7…

cs.CL cs.AI
arxiv.org πŸ“… 2026 πŸ“° arXiv πŸ“„ PDF
Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
πŸ‘€ Subhadip Mitra

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse d…

cs.DC cs.CL cs.LG
arxiv.org πŸ“… 2025 πŸ“° arXiv πŸ“„ PDF
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
πŸ‘€ Md. Mehedi Hasan; Sk Tanzir Mehedi; Ziaur Rahman; Rafid Mostafiz; Md. Abir Hossain

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prom…

cs.CR cs.AI
arxiv.org πŸ“… 2019 πŸ“° arXiv πŸ“„ PDF
A Benchmark Study of Machine Learning Models for Online Fake News Detection
πŸ‘€ Junaed Younus Khan; Md. Tawkat Islam Khondaker; Sadia Afroz; Gias Uddin; Anindya Iqbal

The proliferation of fake news and its propagation on social media has become a major concern due to its ability to create devastating impacts. Different machine learning approaches have been suggested to detect fake news. However, most of those focused on a specific type of news (such as political) which leads us to t…

cs.CL cs.IR cs.LG stat.ML
DOI: 10.1016/j.mlwa.2021.100032
arxiv.org πŸ“… 2025 πŸ“° arXiv πŸ“„ PDF
Noise-Driven Persona Formation in Reflexive Neural Language Generation
πŸ‘€ Toshiyuki Shigemura

This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our …

cs.CL
arxiv.org πŸ“… 2024 πŸ“° arXiv πŸ“„ PDF
SSFF: Investigating LLM Predictive Capabilities for Startup Success through a Multi-Agent Framework with Enhanced Explainability and Performance
πŸ‘€ Xisen Wang; Yigit Ihlamur; Fuat Alican

LLM based agents have recently demonstrated strong potential in automating complex tasks, yet accurately predicting startup success remains an open challenge with few benchmarks and tailored frameworks. To address these limitations, we propose the Startup Success Forecasting Framework, an autonomous system that emulate…

cs.AI
arxiv.org πŸ“… 2025 πŸ“° arXiv πŸ“„ PDF
Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning
πŸ‘€ Trishala Jayesh Ahalpara

We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synth…

cs.CL cs.AI cs.HC cs.LG
Also search: arXiv β†’ PubMed β†’ Semantic Scholar β†’ Google Scholar β†’