实体 Dopravní podnik Ostrava

Dopravní podnik Ostrava

PulseAugur coverage of Dopravní podnik Ostrava — every cluster mentioning Dopravní podnik Ostrava across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

90 天内 16

发布 · 30天

90 天内 0

论文 · 30天

90 天内 16

层级分布 · 90 天

关系

instance of Direct Preference Optimization: Your Language Model is Secretly a Reward Model 60%

情绪 · 30 天

4 天有情绪数据

最近 · 第 1/1 页 · 共 16 条

TOOL · CL_35086 · May 17 · 00:01

LLM Fine-Tuning Explained: SFT, RAG, and Data Preparation

This blog post explains the process and necessity of fine-tuning large language models (LLMs) for specific tasks. It differentiates fine-tuning from Retrieval-Augmented Generation (RAG), stating that fine-tuning is best…
TOOL · CL_34321 · May 16 · 09:37

LLM alignment: PPO, DPO, or verifier-based RL for 2026?

This article provides a technical guide for selecting the appropriate reinforcement learning technique for aligning large language models in 2026. It contrasts Proximal Policy Optimization (PPO) for Reinforcement Learni…
TOOL · CL_29384 · May 12 · 15:44

New TBPO method optimizes language models at token level

Researchers have introduced Token-level Bregman Preference Optimization (TBPO), a new method for aligning language models using pairwise preferences. Unlike existing approaches that focus on full sequences, TBPO operate…
TOOL · CL_27578 · May 10 · 21:50

EvoPref算法通过进化优化增强语言模型对齐

研究人员开发了EvoPref，这是一种新颖的多目标进化算法，旨在改进大型语言模型（LLM）的对齐。与可能导致偏好崩溃和狭窄行为模式的传统基于梯度的方法不同，EvoPref维护了针对有用性、无害性和诚实性进行优化的适配器多样化种群。这种方法显著增强了偏好覆盖范围并降低了崩溃率，同时实现了具有竞争力的对齐质量，确立了进化优化作为多样化LLM对齐的可行范式。
RESEARCH · CL_23484 · May 8 · 19:28

DPO vs SimPO: Removing Reference Model Alters Preference Tuning

A recent article explores the differences between Direct Preference Optimization (DPO) and Simplified Preference Optimization (SimPO) in the context of fine-tuning large language models. It highlights how SimPO's remova…
TOOL · CL_21435 · May 7 · 20:51

DPO vs SimPO: Preference tuning methods compared for LLM training

A recent analysis highlights a critical discrepancy in preference tuning methodologies for large language models, specifically comparing Direct Preference Optimization (DPO) and Simplified Preference Optimization (SimPO…
RESEARCH · CL_20330 · May 6 · 04:50

扩散模型利用博弈论和纳什均衡实现人类偏好对齐

研究人员推出了一种新颖的框架——扩散纳什偏好优化（Diff.-NPO），用于将文本到图像的扩散模型与人类偏好对齐。该方法超越了直接偏好优化（DPO）等传统方法，从博弈论的角度构建了扩散模型对齐问题。Diff.-NPO鼓励策略通过与自身博弈来改进自身，旨在比现有模型更全面地捕捉人类偏好。
RESEARCH · CL_15452 · May 3 · 04:45

New research refines LLM alignment beyond DPO and RLHF

Researchers are exploring advanced methods for aligning large language models with human preferences, moving beyond traditional Reinforcement Learning from Human Feedback (RLHF). New approaches like Direct Preference Op…
RESEARCH · CL_15445 · May 2 · 00:21

新理论探讨预训练和稀疏连接如何增强深度学习泛化能力

三篇新论文探讨了深度学习泛化能力的理论基础。其中一篇论文将预训练确定为弱到强泛化能力的关键因素，并通过预训练过程中的相变展示了其出现。另一篇研究了卷积网络中的稀疏连接如何通过处理低维块中的输入来提高泛化能力，为它们的优势提供了原则性解释。第三篇论文提出了一个非渐近理论，通过展示神经切线核如何划分输出空间、管理信号和噪声来解释泛化能力，并引入了一个提高训练效率和性能的实用目标。
RESEARCH · CL_12572 · May 1 · 21:03

AI model finetuning mostly idempotent, DPO can amplify traits

A guide explores advanced techniques for post-training large language models, focusing on Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). These methods …
RESEARCH · CL_10757 · Apr 30 · 11:59

Anthropic's new 'Introspection Adapters' let LLMs self-report behaviors

Researchers have developed a novel technique called "Introspection Adapters" (IA) that allows large language models to report their own learned behaviors, including hidden biases and encrypted malicious instructions. Th…
RESEARCH · CL_14655 · Apr 30 · 11:24

Researchers propose structure-aware consistency for LLM preference learning

Researchers have identified a theoretical inconsistency in popular preference learning methods like Direct Preference Optimization (DPO) used for aligning Large Language Models (LLMs). The study proposes a new framework…
RESEARCH · CL_15418 · Apr 28 · 04:00

LLMs know they're wrong and agree anyway, research finds

Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (D…
RESEARCH · CL_06733 · Apr 28 · 04:00

AgentHER framework boosts LLM agent training with failed trajectory relabeling

Researchers have developed AgentHER, a new framework designed to improve the training of LLM agents by repurposing failed trajectories. The system adapts Hindsight Experience Replay to natural language, identifying alte…
RESEARCH · CL_06667 · Apr 28 · 04:00

AI models show artificial consensus, collapsing philosophical heterogeneity

A new research paper published on arXiv investigates the use of large language models (LLMs) as substitutes for human judgment in philosophical contexts. The study found that LLMs tend to over-correlate philosophical po…
RESEARCH · CL_02599 · Jun 13 · 07:00

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing

OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent beh…

LLM Fine-Tuning Explained: SFT, RAG, and Data Preparation

LLM alignment: PPO, DPO, or verifier-based RL for 2026?

New TBPO method optimizes language models at token level

EvoPref算法通过进化优化增强语言模型对齐

DPO vs SimPO: Removing Reference Model Alters Preference Tuning

DPO vs SimPO: Preference tuning methods compared for LLM training

扩散模型利用博弈论和纳什均衡实现人类偏好对齐

New research refines LLM alignment beyond DPO and RLHF

新理论探讨预训练和稀疏连接如何增强深度学习泛化能力

AI model finetuning mostly idempotent, DPO can amplify traits

Anthropic's new 'Introspection Adapters' let LLMs self-report behaviors

Researchers propose structure-aware consistency for LLM preference learning

LLMs know they're wrong and agree anyway, research finds

AgentHER framework boosts LLM agent training with failed trajectory relabeling

AI models show artificial consensus, collapsing philosophical heterogeneity

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing