How we built automatic clustering for LLM traces

We're all used to traditional clickstream product analytics data. It's often one of the most important datasets for anyone trying to build something. We're also familiar with backend observability data — metrics, logs, traces — more operational, but still essential.

In the age of AI, though, we now have a different kind of data that combines and mashes aspects of both together. I'm talking about the LLM traces and generations your AI agent or agentic workflows produce as they get stuff done.

This data is crucial for seeing whether your agent is actually doing its job, and it's also rich with insights about how your users are using it, what they're trying to do, and even how they feel about the whole interaction.

Think about all your interactions with an AI in the last week and all the subtle (or not so subtle) signals embedded in them. There's a lot of useful information there for the people building the thing you're using — and it all kind of comes for free by the nature of the UX.

An angry chatbot exchange

One of my recent interactions trying to chat with my airline about tickets.

So with this in mind, we built a new feature recently, aptly named "Clustering." We probably should have called it "Agentic AI-driven magic AI unicorn insights agentic" but I'm bad at names.

Here's how the whole pipeline works under the hood, with links to the code if you want to dig deeper. If you just want to see it in action, skip to the demo.

The pipeline

Here's the overall flow from raw traces to clustered insights:

  1. Ingest — Traces and generations land as PostHog events
  2. Text representation — Convert each trace to a readable text format
  3. Sample — Hourly sampling of N traces/generations
  4. Summarize — LLM-powered structured summarization
  5. Embed — Generate embedding vectors from summaries
  6. Cluster — UMAP dimensionality reduction + HDBSCAN clustering
  7. Label — An AI agent names and describes each cluster
  8. Display — Clusters tab with scatter plot and distribution chart
Traces → Text repr → Sample → Summarize → Embed → Cluster → Label → Display

Design considerations

Before diving into the steps, here are the main considerations we kept coming back to and the choices we made for each:

  • Huge traces — Some traces are enormous and can't just be thrown at an LLM. Our answer: the uniform downsampling described in step 1. We iteratively drop lines while preserving the overall structure, so even massive traces fit within context limits.
  • Keeping costs sane — Running LLMs on every single trace would bankrupt us. So we sample a small random subset each hour, use GPT-4.1 nano for summarization (fast and cheap), and are planning to move to the OpenAI Batch API to optimize further.
  • One-size-fits-all vs. custom summaries — It's hard to have a single general summary schema that works perfectly for every type of trace we might see. Ideally, users could steer or bring their own prompts and structured outputs. For now, we started with a general-purpose prompt and summary schema that works well across most use cases — but enabling user-defined summarization is a potential future improvement.
  • Zero-config — The Temporal workflows just run in the background. As you send traces to LLM Analytics, clustering will just work once there's enough data to sample from. No setup, no configuration needed.
  • User steering — In the future, we'd love for users to define their own clustering configs, filter to specific subpopulations, and even kick off clustering runs on demand. For example, if you know you want to look at just your "refund request" traces, you could define a job for that. We're not there yet, but the architecture supports it.

Step 1: From JSON blobs to readable text

Traces are ingested into LLM Analytics as normal PostHog events. They have their own loose schema around special $ai_* properties — covering everything from generations and spans to sessions — but also the flexibility of general PostHog events, which means they generally work with any other PostHog feature out of the box.

However, this gives us a blob of JSON at the end of the day with some expected $ai_ properties, and even within each property the structure and format can vary wildly depending on the LLM provider, framework, or how users have instrumented their own agents.

So we need to figure out how to get from this bag of JSON to something a clustering algorithm can work with — we need numbers. And whenever you need numbers with LLMs, it often means you need embeddings. Our task is to figure out what embeddings to generate that will be useful downstream.

We can't just throw some massive JSON blob at an LLM and expect it to produce a great summary. It might be okay, but we can do better by putting ourselves in the shoes of our LLM: if I can create a general text representation where I myself can easily "read" a trace, then that'll also be a great input for summarization.

So we built a process that renders each trace as a clean, simple text representation — essentially an ASCII tree with line numbering. Here's what it looks like in the PostHog UI:

Text representation view in PostHog

You can see the full text representation for this trace in this gist — it's a real example from one of my side projects (a daily factoids chatbot that got into a surprisingly deep conversation about mantis shrimp vision and digital signal processing).

The line numbering (L001:, L002:) is important — it gives the downstream summarization LLM a way to reference specific parts of the trace, which makes the structured output much more useful. You can see how this is generated in trace_formatter.py.

Handling huge traces

Some traces are just too big. We use uniform downsampling to shrink them — picking every Nth line from the body while preserving the header. The sampler notes what percentage of lines were kept, and the gaps in line numbers tell the LLM that content was omitted. Using the mantis shrimp trace from the gist, here's a snippet of what a downsampled version might look like:

L001: ----------------------------------------------------------
L002:
[SAMPLED VIEW: ~40% of 136 lines shown]
L003: AVAILABLE TOOLS: 1
L005: web_search(max_results?: integer, query?: any)
L012: [1] SYSTEM
L018: Factoid text: The mantis shrimp can perceive 12-16 types
of color receptors compared to humans' 3...
L035: [2] USER
L037: huh - explain this more to me
L041: [3] ASSISTANT
L045: - web_search(query="mantis shrimp vision 12 vs 16...")
L049: [4] TOOL
L051: {"query": "mantis shrimp vision...", "results": [...]}
L059: [5] ASSISTANT
L061: It seems counterintuitive, right?...
L077: [6] USER
L079: any relation at all to something like DSP or FFT?
L101: [9] ASSISTANT
L103: That is a brilliant connection!...
L122: [10] USER
L124: oh interesting - k bye i love you
L130: [1] ASSISTANT
L132: Haha, well that escalated quickly! I love you too...
L136: **K bye!**

Notice the [SAMPLED VIEW: ~40% of 136 lines shown] header and the jumps in line numbers (e.g., L051 → L059, L079 → L101). The LLM can still follow the conversation flow — the structure and key turns are preserved, just with less detail in between. This works well in practice because the same context often gets passed back and forth within each step of a trace. See the full implementation in message_formatter.py.

Step 2: Summarization

Now that we have a readable text representation, we can summarize it. We sample N traces per hour (cost management — we're not summarizing everything) and send each text representation to an LLM for structured summarization.

The key word there is structured. Rather than asking for a free-text summary, we ask for a specific schema:

Python
# Simplified summarization schema
# See: schema.py
class SummarizationResponse(BaseModel):
title: str # Short descriptive title
flow_diagram: str # ASCII flow of the trace steps
summary_bullets: list[SummaryBullet] # Key points with line refs
interesting_notes: list[str] # Anything unusual or notable
class SummaryBullet(BaseModel):
bullet: str # The summary point
line_refs: list[str] # e.g., ["L003", "L015"] — back-references

See the actual implementation: schema.py

We use GPT-4.1 nano for this — it's fast, cheap, and the structured output mode means we get reliable, parseable results every time. The line references back to the text representation are particularly useful: they let the summary stay grounded in the actual trace data rather than hallucinating. You can read the actual prompts we use: system_detailed.djt and user.djt. (If you have sensitive data in your traces, check out privacy mode which controls what gets sent.)

These summaries appear as $ai_trace_summary and $ai_generation_summary events in your project every hour. They also power the on-demand trace summarization feature you can use to quickly understand any individual trace or generation without reading through the full conversation.

Why structured output matters

Structured output is better than a raw summary for one critical reason: downstream embedding quality. When we embed a title + flow diagram + specific bullets with line references, we get a much higher signal representation than embedding a wall of free text. The LLM has already done the work of extracting what matters.

Why not RAG? We considered building a full RAG system — chunking traces, building indices, retrieving at query time. But the complexity of doing that well at PostHog's scale (billions of events across thousands of teams) made us reach for a simpler approach first. Summarization-first means each trace becomes a small, self-contained artifact that's easy to embed and cluster. We may still build RAG for features like natural language search, but for clustering, this works well.

Step 3: Embedding

Once we have summaries, we format them back into plain text — title, flow diagram, bullets, and notes concatenated together, with line numbers stripped to reduce noise — and embed them using OpenAI's text-embedding-3-large model, giving us a 3,072-dimensional vector for each summary.

Why embed the enriched summary instead of the raw trace? Because the summary lives in a higher-level semantic space. Raw traces contain a lot of noise — token counts, model versions, repeated system prompts. The summary captures the intent and flow, which is exactly what we want our clusters to be organized around.

So now, for each trace or generation, we have 3,072 numbers. Time for the fun part.

Step 4: Clustering

This is where we lean on more "old school" traditional ML techniques. There are still several areas where LLMs haven't eaten the world: recommender engines, clustering, time series modeling, tabular model building, and getting well-calibrated numbers for regression problems. For all of these, you're still better off equipping your agent with the tools to formulate and run traditional algorithms on the data, rather than asking it to do the calculations directly.

Dimensionality reduction

We can't just cluster the raw 3,072-dimensional vectors — the curse of dimensionality would make distances meaningless. So we first need to reduce dimensions while preserving the important structure.

We actually run UMAP twice, with different goals:

3072-D embeddings
┌────┴────┐
↓ ↓
100-D 2-D
(UMAP) (UMAP)
↓ ↓
HDBSCAN Scatter plot
  • 3072 → 100 dimensions for clustering: min_dist=0.0 packs similar points tightly together, which is what HDBSCAN wants.
  • 3072 → 2 dimensions for visualization: min_dist=0.1 keeps some visual separation so the scatter plot is readable.

HDBSCAN

For the actual clustering, we use HDBSCAN on the 100-D reduced embeddings:

Python
# Simplified clustering pipeline
# See: clustering.py
import umap
import hdbscan
import numpy as np
def cluster_embeddings(embeddings: np.ndarray):
"""Run the full clustering pipeline."""
# Normalize embeddings
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / norms
# UMAP for clustering (100-D)
reducer_cluster = umap.UMAP(
n_components=100,
min_dist=0.0,
metric="cosine",
)
embeddings_100d = reducer_cluster.fit_transform(normalized)
# UMAP for visualization (2-D)
reducer_viz = umap.UMAP(
n_components=2,
min_dist=0.1,
metric="cosine",
)
embeddings_2d = reducer_viz.fit_transform(normalized)
# HDBSCAN clustering on 100-D embeddings
clusterer = hdbscan.HDBSCAN(
cluster_selection_method="eom",
min_cluster_size=5,
)
labels = clusterer.fit_predict(embeddings_100d)
return labels, embeddings_2d

See the actual implementation: clustering.py

HDBSCAN has some nice properties for our use case:

  • No need to pick k (number of clusters) in advance — it figures this out automatically based on density.
  • Noise cluster — items that don't fit any cluster get assigned to cluster -1. These "outliers" can be interesting edge cases you might want to explore.
  • cluster_selection_method="eom" (Excess of Mass) tends to produce more granular, interpretable clusters compared to the "leaf" method.

After this step, every trace or generation has an integer cluster ID. But an integer doesn't tell you much — we need to make these meaningful.

Step 5: The labeling agent

Now we have clusters with integer IDs. Cluster 0 has 47 traces, cluster 1 has 23, cluster -1 (noise) has 8. Not exactly actionable.

This is where we bring AI back in — specifically, a LangGraph ReAct agent powered by GPT-5.2. Its job is to explore the clusters and come up with meaningful labels and descriptions. You can read the agent's system prompt in prompts.py.

The agent has access to 8 tools:

ToolPurpose
get_clusters_overviewHigh-level stats: cluster sizes, counts
get_all_clusters_with_sample_titlesQuick scan of what's in each cluster
get_cluster_trace_titlesAll trace titles for a specific cluster
get_trace_detailsFull summary for a specific trace
get_current_labelsSee labels assigned so far
set_cluster_labelSet name + description for one cluster
bulk_set_labelsSet labels for multiple clusters at once
finalize_labelsSignal that labeling is complete

The agent follows a two-phase strategy:

Python
# Simplified labeling agent flow
# See: labeling_agent/
# Phase 1: Bulk labels (guaranteed coverage)
# Agent calls get_all_clusters_with_sample_titles to get an overview,
# then bulk_set_labels with initial labels for ALL clusters.
# This ensures every cluster gets a label even if the agent
# runs out of budget.
# Phase 2: Refinement
# Agent iterates through clusters that seem ambiguous,
# calling get_cluster_trace_titles and get_trace_details
# to drill deeper, then set_cluster_label to refine.
# Finally: finalize_labels to signal completion.

See the actual implementation: labeling_agent/

The two-phase approach is deliberate. If the agent only refined one cluster at a time and hit a token limit or error partway through, you'd end up with half your clusters unlabeled. Bulk-first ensures coverage, then refinement improves quality where it can.

Step 6: Orchestration with Temporal

All of this needs to run reliably across thousands of teams. We use Temporal workflows to orchestrate the pipeline.

A daily coordinator workflow discovers eligible teams (those with enough recent trace data), then spawns child workflows in batches — up to 4 concurrent workflows at a time to manage load:

Python
# Simplified coordinator
# See: coordinator.py
# Daily coordinator workflow:
# 1. Query for teams with recent LLM trace data
# 2. For each eligible team, spawn a child workflow:
# a. Fetch recent embeddings
# b. Run clustering pipeline (UMAP + HDBSCAN)
# c. Run labeling agent
# d. Emit cluster events
# 3. Max 4 concurrent child workflows

See the actual implementation: coordinator.py

The output of each child workflow is a set of $ai_trace_clusters and $ai_generation_clusters events. These are standard PostHog events, which means the clusters tab in the UI is just querying events like everything else in PostHog.

A concrete example

Here's what you're seeing in the demo:

  • Scatter plot — Each dot is a trace, positioned using the 2-D UMAP coordinates. Colors represent cluster assignments. You can immediately see which groups of traces are similar and how they relate spatially.
  • Cluster distribution — A bar chart showing how many traces landed in each cluster, with the agent-generated labels. This gives you a quick sense of what your users are actually doing.
  • Drill-down — Click into any cluster to see the individual traces, their summaries, and the full details. This is where you find the patterns — maybe 40% of your traces are "user asking for refund status" and you didn't even know.

The noise cluster (labeled "Outliers") often contains the most interesting one-off traces — edge cases, unusual workflows, or bugs that don't fit any pattern. Pair this with evaluations (our LLM-as-a-judge feature) to automatically score the quality of generations within each cluster.

Try it now

If you're already using LLM Analytics in PostHog, clustering runs automatically — no configuration needed. Once you have enough trace data flowing in, clusters will appear in the clusters tab. If you're not using LLM Analytics yet, you can get started in minutes with SDKs for OpenAI, Anthropic, LangChain, Vercel AI, and many more.

Try clusters in PostHog

PostHog is an all-in-one developer platform for building successful products. We provide product analytics, web analytics, session replay, error tracking, feature flags, experiments, surveys, LLM analytics, data warehouse, CDP, and an AI product assistant to help debug your code, ship features faster, and keep all your usage and customer data in one stack.

Community questions

Questions about this page? or post a community question.