METHODS · CLUSTERING

How AI news clustering works at PulseAugur

From 200+ sources to ~100 clusters per day — the deduplication, scoring, and gating that decides what reaches the Brief.

By Chris Valentine · Updated May 2026

Why clustering

Pick any AI announcement from the last week. The OpenAI release, the Anthropic research drop, the policy filing, the leaked roadmap. Now count the places that wrote about it: the vendor blog, the NYT piece, the Stratechery follow-up, the arXiv paper that dropped alongside, the Hacker News thread, the Reddit reaction in three subreddits, four Bluesky takes from lab people, two Mastodon threads, a TLDR AI bullet, a Smol AI summary, and the inevitable Substack hot take. That's sixteen sources for one event.

Read them all and you've absorbed the same story sixteen times. Skim the ones the algorithm pushed in front of you and you've seen the same headline three times and missed the actually- interesting Bluesky thread from the engineer who built the thing. Either way, the deduplication failure is the actual problem — and nobody who's writing one of those sixteen pieces is incentivized to solve it, because every outlet wants you on their story.

PulseAugur's bet: the unit of AI news isn't the article, it's the event. Cluster every angle of one event into one page. Show the source list ranked by authority. Link to the original arXiv paper or vendor blog post first; the press paraphrases second. That single change means a reader can move through the industry's news at the pace of events, not at the pace of headlines.

Ingest pipeline

Every fifteen minutes a worker polls the source list — RSS, Atom, JSON feeds, syndication endpoints, and social-platform timelines for the platforms that expose them. Each candidate item is normalized into a common shape: id, title, title_en (machine-translated when the source language is non-English), body, url_original, lang, published_at, source_id, raw_json. The translation step runs only on titles — body content stays in source language and we cluster across languages on the title space.

Once an item is normalized, the duplicate detector decides whether the item belongs in an existing cluster or starts a new one. The detector uses a multi-signal stack documented in the next section. New clusters trigger an LLM enrichment pass that writes the cluster summary, the AI-impact blurb, and the newsworthiness bucket. Existing clusters get the new member appended and the cluster's score recomputed; if the cluster crosses the doubling threshold (item count doubles since the last summary), the LLM enrichment re-runs against the expanded corpus.

The whole loop, end-to-end, runs in roughly two to four minutes from a fresh source-poll to the cluster appearing on the live Brief. Cluster scores recompute hourly against a sliding window of the corpus.

Deduplication

The duplicate detector runs four signal stacks against the existing cluster set, in order:

Canonical URL match. Lowercase host, strip UTM and tracking params, honor the page's <link rel="canonical"> tag, dereference URL shorteners. If two items resolve to the same canonical URL they're the same artifact and cluster trivially. This catches the syndication graph cleanly.
Title shingling. N-gram overlap on the normalized title space. Catches "OpenAI Announces GPT-5" and "GPT-5: OpenAI's Latest Model" as the same story even when the URLs differ. The shingle threshold is tuned high enough to avoid clustering distinct stories about the same entity (two unrelated Anthropic posts on the same day stay separate).
Cross-language matching. When non-English titles are translated to English at ingest, the translated title space goes through the same shingling pass. A Le Monde piece about the OpenAI announcement clusters with the NYT piece even though the source languages diverge. This is the layer that powers the cross-language coverage you see on cluster pages — French, German, Japanese, Chinese, and Spanish coverage all land in the same cluster as the English originals.
Embedding similarity. The fallback. Title and lead-paragraph embeddings against the recent cluster set; cosine similarity above threshold triggers a cluster membership candidate that the title-shingle layer didn't catch (paraphrased headlines, opinion pieces about the same underlying event).

Each layer can either match (item joins the existing cluster) or pass (item moves to the next layer). If all four layers pass, the item becomes a new cluster. The signal stack has been validated against PulseAugur's own corpus growth from zero to 22,000+ clusters; false-positive rates (two distinct stories accidentally clustered) sit below 2% on spot checks. False-negative rates (one story split across multiple clusters) are higher, around 5% — the conservative side of the trade. We'd rather show two clusters that turn out to be the same event than one cluster that conflates two events.

Scoring

Each cluster carries a display_score from 0 to 100. Behind the displayed number, six dimensions feed in. Each dimension is computed in its own pipeline stage, normalized to a 0–1 range, and then sigmoid-combined into the raw_score that gets scaled to the displayed display_score. The dimensions:

dim_authority — source credibility. A weighted sum of the source authorities for every member in the cluster. A cluster with three TIER_1 sources outweighs a cluster with twelve TIER_3 sources. This is the dimension that prevents a viral Reddit thread from outranking a vendor announcement covered by three major press outlets.
dim_cluster_strength — corroboration. How many independent sources agree on the event. Independence here means distinct publishers, not distinct URLs — three Hearst properties republishing the same wire story count as one source for this dimension. The signal is "do we have multiple unrelated parties confirming this."
dim_originality — primary vs. echo. Whether the cluster's lead source is primary reporting or a paraphrase of an earlier source. We boost vendor blogs, arXiv papers, and original investigative pieces; we demote clusters that are entirely composed of "X reports that Y said" coverage.
dim_headline — LLM-judged headline strength. The cluster summary call also asks the model to rate the story's headline strength on a constrained scale. Surprisingly useful — captures the "is this actually interesting or is this routine" signal that pure aggregation metrics miss.
dim_buzz — social velocity, normalized. Hacker News upvote velocity, Reddit comment density, Bluesky + Mastodon repost rates. Normalized so a story breaking on a Sunday at 2 AM doesn't get penalized for hitting an empty timeline.
dim_velocity — time-decayed engagement. How fast the cluster grew from creation to its current member count. A cluster that picked up ten sources in the first hour ranks above one that picked up the same ten sources over a week.

The sigmoid combination keeps any single dimension from dominating; the gating layer (Lever B for outlier rejection, Lever C for bucket quality, Lever D for hedge detection) then applies tier-specific demotions before the score gets persisted. The full implementation lives in rank/score.py — every parameter has a doc-string explaining what it controls and what data the threshold was tuned against.

Newsworthiness buckets

Beyond the numeric score, every cluster carries a newsworthiness bucket — an LLM-judged label that goes through quality gates before it becomes the persisted effective bucket:

frontier_release. Major model releases or capability announcements from a frontier-tier lab. Gated on ≥3 corroborating sources and TIER_1 authority average; otherwise demoted to significant.
significant. Industry-shaping news that's not a model launch — funding rounds at scale, executive moves, regulatory action, major enterprise rollouts.
research. Papers, alignment results, eval suites, methodological breakthroughs. Gated on ai_relevance ≥ 0.6 and ≥3 sources; otherwise demoted to tool.
tool. Product launches, feature drops, developer tooling, useful-but-not-shaping coverage.
commentary. Opinion, analysis, retrospective pieces. Demoted from significant when the LLM flags hedge-detection signals (the writer is speculating rather than reporting).
meme. Off-topic, viral-but-not-substantive, or refused (the LLM declined to summarize). Filtered out of the default Brief view.

Two bucket fields exist on each cluster for audit transparency. newsworthiness_bucket stores the LLM's raw label; effective_bucket stores the post-gate label actually used for ranking and downstream consumers. The bucket_reason field captures the demotion trace when the two diverge — for instance, "[lever_c_demoted from research: ic=2 ai=0.45]" tells you the cluster was originally bucketed as research but failed the corroboration gate. Read the full editorial transparency posture at /editorial-standards.

How we attribute

Every cluster page lists every source article that fed the cluster — publisher name, byline, original-language title plus machine-translated English title where applicable, publication timestamp, and a TIER chip surfacing the source's authority weight. The link on each member row goes to the original source URL, not a PulseAugur intermediary.

The same citation graph ships in machine-readable form via schema.org/NewsArticle.isBasedOn in the cluster page's JSON-LD. Crawlers and AI engines that parse the structured data see the citation list directly, which is what lets PulseAugur cluster pages get cited as roundup-style sources by Perplexity, ChatGPT, Gemini, and Claude when those engines answer queries about specific AI events.

The cluster summary itself is machine-written. We mark it explicitly with an AI chip and a per-summary disclosure naming the LLM that wrote it (Gemini 2.5 Pro, Claude Opus 4.7, GPT-5, depending on the cluster). Corrections policy: 24-hour acknowledgement, 48-hour fix, with a dateModified stamp bump so syndicated indexes pick up changes. Email [email protected] for any cluster you think we got wrong.

Why this beats single-editor digests

The single-editor digests — The Batch, Import AI, Last Week in AI, Smol AI, even Stratechery — are excellent at what they do. One editor picks a small set of stories per week, writes original analysis, and ships a coherent reading experience. PulseAugur is not trying to replace them.

The shape PulseAugur covers is different. Single-editor digests are weekly — by design, because writing real analysis takes time. They surface roughly five to ten stories per issue. Their selection bias is the editor's own taste, which is the feature, not a bug.

PulseAugur is continuous (clusters appear within minutes of source emission), broad (~100 clusters per day surface in the default Brief), and cross-language (a French Le Monde piece, a German Heise piece, and a Japanese Nikkei piece about the same event all cluster together). Different audience, different reading cadence, different decisions. If you want one curated essay a week, read The Batch; if you want to see what's actually moving through the industry right now, read PulseAugur. Best practice is both.

The structural advantage clustering has over single-editor curation is that scale doesn't degrade signal. An editor who picks ten stories a week can write thoughtfully about each one but can't cover the long tail. A clustering pipeline that processes 200+ sources continuously sees the long tail by construction; the scoring layer handles "is this actually interesting" without humans having to triage every candidate. Both shapes are valuable; PulseAugur fits the slot the single-editor format structurally can't fill.