Glassbox: Grab vLLM's Attention
Transformer internals contain a great deal of information about how a model is behaving. Attention patterns, in particular, are a promising source of signals for hallucination detection, failure mode diagnosis, task drift detection, uncertainty estimation, and ongoing monitoring of model behavior.
This post introduces glassbox, a vLLM plugin for extracting structured signals from transformer attention during inference.
Why build this?
Raw activations and full attention matrices are expensive to retain, and modern serving engines like vLLM are specifically optimized to never materialize the full L×L attention matrix (FlashAttention, Triton kernels). That’s a good thing for performance, but it means the structure we care about is hidden by design.
Because of that, most tools for inspecting transformer internals live in research harnesses around HuggingFace models, creating a gap between papers and what is practical for inference.
What is glassbox?
Glassbox is a vLLM plugin that instruments the attention path during inference to extract compact, structured features from attention-related operators. It is designed to be a practical implementation of ideas from the attention analysis literature in a form that works inside a real inference engine.
The project is built around three ideas:
- Research-informed signals
- Built for inference
- vLLM-native
We will expand on each of these in the following sections.
1. Research-informed signals
Glassbox implements state-of-the-art methods for extracting and analyzing attention structure, and extends them with new techniques from active research by the safety team at Red Hat AI.
Today, glassbox extracts features from five attention-derived operators:
Spectral features from the pre-softmax scores matrix S = QKT. The leading singular values of S reveal attention sharpness (is the head focused on one dominant pattern or distributing attention across several?), whether a head is content-adaptive or positional, and how the attention structure evolves over the course of generation.
Attention symmetry features from the raw post-softmax matrix A = softmax(S/√d) (AttentionTracker, arXiv:2411.00348). These capture the coupling between symmetric and antisymmetric parts of the attention matrix — structure connected to recent results on mechanistic classification of failure modes such as prompt injection and hallucination.
Diagonal self-attention features from the diagonal of A (LLM-Check, NeurIPS 2024). The self-attention weight A[i,i] — how much a token attends to itself — correlates with model confidence and factuality. Glassbox extracts these without materializing A by computing only the diagonal entries.
Laplacian eigenvalue features from the in-degree graph Laplacian L = Din - A (LapEigvals, EMNLP 2025, arXiv:2502.17598). Treating attention as a weighted directed graph, the Laplacian diagonal captures how much attention each token receives. For causal attention, eigenvalues are just the diagonal entries — no decomposition needed.
Routing features from the degree-normalized post-softmax operator M = DQ-1/2 A DK-1/2 (Dahlem et al., upcoming). Degree normalization removes heterogeneity from uneven attention distributions and exposes the effective routing structure of attention. Glassbox extracts spectral features from M as well as features from its Hodge decomposition — separating attention asymmetry into potential-driven (gradient-like) and circulatory (irreversible) components. These routing and flow features are new implementations from ongoing research by the safety team at Red Hat AI.
2. Built for inference
Research tools that analyze attention typically materialize the full L×L matrix — fine for a notebook, but a non-starter at serving time. Glassbox is designed from the ground up with the inference use case in mind.
Matrix-free algorithms. The core idea is that we never need the attention matrix explicitly. We only need to multiply it by vectors:
S·v = Q·(Kᵀ·v) — two thin matrix-vector products, O(Ld)
Sᵀ·u = K·(Qᵀ·u) — same cost
Each multiplication costs O(Ld) through the L×d factors Q and K — not O(L²). These “matvec” primitives are used in iterative SVD algorithms within glassbox.
For the post-softmax operator, the problem is harder. Standard FlashAttention and Triton kernels already tile the softmax computation to avoid materializing the full matrix — but they’re fused with the value matrix V from the KV cache. SVD probing requires applying softmax(QKT/√d) to arbitrary vectors, not V. Glassbox provides a fused Triton kernel that replaces V with the probe vectors, using the same online-softmax technique to apply the attention operator to multiple test vectors in a single kernel launch.
Configurable overhead. Glassbox provides fine-grained control over the observability-to-latency tradeoff. Each feature group can be independently enabled or disabled. Extraction intervals, monitored layers and heads, SVD rank, and algorithm choice are all configurable:
spectral:
enabled: true
interval: 32 # extract every 32 decode steps
rank: 4
method: randomized
heads: [0, 1, 2, 3] # monitor only these heads
routing:
enabled: true
interval: 64 # less frequent for heavier features
rank: 4
heads: [0]
tracker:
enabled: false # off by default
selfattn:
enabled: true
interval: 32
heads: [0]
laplacian:
enabled: true
interval: 32
heads: [0]
output:
path: glassbox.jsonl
emit:
otel: true # emit as OpenTelemetry spans
You choose how much observability you want and how much latency you’re willing to pay for it.
3. vLLM-native
Glassbox works through vLLM’s supported extension points:
Custom attention backend Glassbox’s SVDTritonAttentionBackend provides access to attention internals.
The implementation class calls the parent forward() first (standard Triton attention runs exactly as before), then accumulates Q tokens and periodically extracts K from the paged KV cache to run the feature extraction pipeline. Everything else — metadata builder, KV cache shape, kernel selection — is inherited unchanged.
We then make the backend available via an entry point in pyproject.toml, from which vLLM loads glassbox automatically:
[project.entry-points."vllm.general_plugins"]
glassbox = "glassbox.vllm_plugin:register_svd_backend"
We are also exploring out-of-tree torch.compile passes and register_forward_hook as additional extraction paths for hidden states and activations — a complementary signal source beyond attention. The latter has coincidentally been proposed as a recent vLLM RFC (vllm-project/vllm#36998), so we are excited to see it land.
Running glassbox
Glassbox supports three run modes depending on the use case:
vllm serve — production inference. Start vLLM with the custom backend and glassbox activates automatically via the plugin entry point. With emit.otel: true in glassbox.yaml, feature snapshots are emitted as OpenTelemetry spans that flow through the same OTel collector vLLM already uses (Jaeger, Tempo, Datadog, etc.) — zero additional infrastructure.
vllm serve facebook/opt-125m --attention-backend CUSTOM --enforce-eager
glassbox-run — single-prompt testing. Quick way to try it out during development:
glassbox-run \
--model facebook/opt-125m \
--signal spectral \
--interval 16 --rank 4 --heads 0 \
--otel \
--prompt "The future of artificial intelligence is"
glassbox-extract — offline feature extraction. Runs two-phase extraction on labeled datasets and produces JSONL and Parquet files for downstream training:
glassbox-extract \
--model Qwen/Qwen2-7B-Instruct \
--dataset halueval_hallucination \
--signal spectral,routing,tracker,selfattn,laplacian \
--parquet
Signal emission
Extracted signals are packaged into “snapshot” objects and flow through a pluggable handler system. Multiple handlers can be active simultaneously:
- JsonlHandler writes every snapshot as a JSON line — for archival, bulk analysis, and training data.
- OtelHandler emits each snapshot as an OpenTelemetry span with
glassbox.*attributes — for real-time detection pipelines. - Custom handlers implement the
SnapshotHandlerprotocol to forward snapshots to any sink: a feature store, Kafka, Redis, a trained classifier, or whatever your detection pipeline needs.
OTel is just one delivery mechanism. The handler system is designed so that downstream consumers — whether a hallucination classifier, a monitoring dashboard, or a feature store — can plug in directly.
Each snapshot is a structured record like this:
{
"signal": "spectral",
"request_id": 0,
"layer": "model.layers.3.self_attn.attn",
"layer_idx": 3,
"head": 0,
"step": 32,
"L": 38,
"singular_values": [1582.6, 173.6, 102.4, 89.1],
"features": {
"sv1": 1582.6,
"sv_ratio": 9.12,
"sv_entropy": 0.63
}
}
A first look at what comes out
Here’s what it looks like in practice. We run glassbox-run on OPT-125m with spectral features enabled:
$ glassbox-run --model facebook/opt-125m --signal spectral \
--interval 16 --rank 2 --heads 0 \
--prompt "The future of artificial intelligence is"
As the model generates tokens, glassbox streams feature snapshots for every layer at each configured interval:
INFO Creating vLLM engine with CUSTOM attention backend
INFO Model: facebook/opt-125m
INFO Signals: spectral=enabled routing=disabled tracker=disabled selfattn=disabled laplacian=disabled
INFO Spectral: interval=16 rank=2 method=randomized heads=[0]
INFO Starting generation...
[spectral] layers.0.self_attn head=0 step=32 L=38 sv1=274.6 sv_ratio=1.81 sv_entropy=1.32
[spectral] layers.1.self_attn head=0 step=32 L=38 sv1=1067.9 sv_ratio=14.4 sv_entropy=0.54
[spectral] layers.2.self_attn head=0 step=32 L=38 sv1=508.0 sv_ratio=1.94 sv_entropy=1.22
[spectral] layers.3.self_attn head=0 step=32 L=38 sv1=1582.6 sv_ratio=9.12 sv_entropy=0.63
...
[spectral] layers.11.self_attn head=0 step=32 L=38 sv1=816.3 sv_ratio=3.03 sv_entropy=1.15
[spectral] layers.0.self_attn head=0 step=64 L=70 sv1=570.3 sv_ratio=2.30 sv_entropy=1.29
[spectral] layers.1.self_attn head=0 step=64 L=70 sv1=2004.5 sv_ratio=12.5 sv_entropy=0.59
...
INFO Generated: clouding the world. A report from Canalys found that artificial
intelligence is projected to grow significantly by 20% by 2023 ...
Each line is a feature snapshot for one (layer, head, step) tuple. The same data can be written to JSONL files, emitted as OpenTelemetry spans, or forwarded to any custom handler.
What the features tell you
The σ₁/σ₂ ratio — how sharply attention is concentrated on a single dominant pattern — tells a different story at each layer:
| Layer | step 16 | step 32 | step 48 | step 64 | Behavior |
|---|---|---|---|---|---|
| 1 | 14.1 | 14.4 | 14.1 | 12.5 | Constant — content-independent (positional) |
| 3 | 12.2 | 9.12 | 6.3 | 5.33 | Decaying — attention spreads as context diversifies |
| 7 | 1.22 | 1.22 | 1.59 | 1.32 | Near-isotropic — no single direction dominates |
Layer 1 has a rock-stable ratio of ~14 regardless of what’s being generated — the signature of a fixed structural pattern, almost certainly positional attention (e.g. always attend to the BOS token).
Layer 3 starts very sharp (12×) but steadily decays to 5× as the generated text introduces diverse entities. The dominant Q-K direction can’t capture everything — a second direction gains weight.
Layer 7 stays nearly isotropic (~1.2–1.3×) throughout — multiple attention patterns are active simultaneously, with no single direction dominating.
This works across architectures. We’ve verified it on GPT-2 (124M, standard multi-head attention) and Qwen2-7B-Instruct (7B, grouped-query attention with 28 Q heads / 4 KV heads) with no code changes.
This is a small model on a toy prompt. But the point is not the specific numbers — it’s that different layers have qualitatively different spectral behaviors during generation, and those behaviors are interpretable. That’s the kind of structure downstream systems can learn from.
The vision
Our goal is to make a class of model-internal signals operational inside vLLM — signals that help distinguish normal behavior from anomalous behavior, localize where a model starts to drift, and that downstream systems can learn from, compare, aggregate, and act on.
Extracting these signals is half the problem. The other half is: once you have them, how do you act on them?
Tools like vLLM Semantic Router route requests based on external signals — the query text, domain, urgency. Glassbox provides a complementary set of internal signals — what the model is actually doing during generation. The question we’re exploring is: what is the right mechanism to close the loop?
We’re experimenting with several directions:
- Inline detection. A
ClassifierHandlerthat runs a trained probe on attention features during inference and flags suspicious behavior in real time — via OTel, logging, or a custom handler. - vLLM-native feedback. The Observation Plugin RFC (vllm-project/vllm#36998) proposes ABORT/CONTINUE actions based on model internals. We’re prototyping an integration that bridges glassbox verdicts into this mechanism, allowing the engine to halt a request when hallucination is detected from attention features.
- External serving. A REST endpoint that exposes trained classifiers to external routing systems — semantic router, llm-d, or custom orchestrators.
- Shadow deployment. Running glassbox via llm-d in shadow mode alongside the serving path, separating the observability workload from inference entirely for cases where zero inline overhead is required.
We don’t yet know which pattern fits best, and the answer may be different for different use cases. We’re figuring it out.
Where we are going next
In future posts, we plan to go deeper into:
- Feature evaluation. How do the derived features actually correlate with failure modes? We plan to run across labeled datasets like HaluEval and TruthfulQA and see what signal is there — even if the answer is mixed.
- Overhead and benchmarks. How much does this cost? What does the observability-to-latency tradeoff look like in practice?
- Closing the loop. From signal extraction to action — training detection models on glassbox features, running them inline, and integrating with vLLM’s request lifecycle.
- Design details. The matrix-free extraction path, the fused Triton kernel, and the engineering of the vLLM integration in more depth.