Perplexity details research on SFT+RL pipeline for accurate, efficient AI answers

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 5 sources

Perplexity has detailed its proprietary post-training pipeline that enhances base models for search-augmented question answering. This process involves initial fine-tuning for instruction following and safety, followed by on-policy reinforcement learning to boost search accuracy and efficiency. The company's reward design prioritizes correctness and user preference, preventing the model from generating plausible but incorrect responses. Perplexity claims this method, when applied to Alibaba's Qwen models, achieves comparable or superior factuality to GPT models at a reduced cost. AI

Summary written by gemini-2.5-flash-lite from 5 sources. How we write summaries →

IMPACT Perplexity's research details a pipeline that improves model accuracy and efficiency for search-augmented answers, potentially lowering operational costs.

RANK_REASON Perplexity published new research detailing their model post-training pipeline.

Read on X — Perplexity →

Perplexity details research on SFT+RL pipeline for accurate, efficient AI answers

COVERAGE [5]

X — Perplexity TIER_1 · perplexity_ai · 2026-04-22 18:15

This pipeline is why the same base model produces more accurate, better-cited, and more efficient answers inside Perplexity than out of the box.

This pipeline is why the same base model produces more accurate, better-cited, and more efficient answers inside Perplexity than out of the box. Read our research: https://t.co/pYjUTnkPMW
X — Perplexity TIER_1 · perplexity_ai · 2026-04-22 18:15

Our reward design combines correctness, preference, and efficiency.

Our reward design combines correctness, preference, and efficiency. Preference only counts when the answer is correct. This keeps the model from optimizing for better-sounding wrong answers. https://t.co/VbJ1M4o26w
X — Perplexity TIER_1 · perplexity_ai · 2026-04-22 18:15

@Alibaba_Qwen We first fine-tune the model to follow instructions, stay within guardrails, and keep language consistent.

@Alibaba_Qwen We first fine-tune the model to follow instructions, stay within guardrails, and keep language consistent. Then we run on‑policy RL to improve search accuracy and tool efficiency while preserving those behaviors. https://t.co/KaVs7h5Ixa
X — Perplexity TIER_1 · perplexity_ai · 2026-04-22 18:15

We've published new research on how we post-train models for accurate search-augmented answers.

We've published new research on how we post-train models for accurate search-augmented answers. Our SFT + RL pipeline improves search, citation quality, instruction following, and efficiency. With Qwen models, we match or beat GPT models on factuality at a lower cost. https://t…
X — Aravind Srinivas (Perplexity) TIER_1 · Aravind Srinivas · 2026-04-22 18:15

RT Perplexity: We've published new research on how we post-train models for accurate search-augmented answers. Our SFT + RL pipeline improves search, ...

RT Perplexity We've published new research on how we post-train models for accurate search-augmented answers. Our SFT + RL pipeline improves search, citation quality, instruction following, and efficiency. With Qwen models, we match or beat GPT models o…

COVERAGE [5]

This pipeline is why the same base model produces more accurate, better-cited, and more efficient answers inside Perplexity than out of the box.

Our reward design combines correctness, preference, and efficiency.

@Alibaba_Qwen We first fine-tune the model to follow instructions, stay within guardrails, and keep language consistent.

We've published new research on how we post-train models for accurate search-augmented answers.

RT Perplexity: We've published new research on how we post-train models for accurate search-augmented answers. Our SFT + RL pipeline improves search, ...

RELATED ENTITIES

RELATED TOPICS