Need to Know
ABAC-gated RAG for an internal knowledge fabric
A platform-engineering org had a knowledge problem hiding inside an access problem. Runbooks, postmortems, internal API specs, contracts, board materials, comp bands, six functions, four sensitivity tiers, one sprawling corpus. They tried two SaaS RAG tools; procurement killed both because the corpus couldn't leave their cloud. Engineers were doing 6-window grep across Confluence, GitHub, and the VDR every time they needed to remember why a service was decommissioned in 2023. Studio built bespoke: self-hosted hybrid retrieval (dense plus BM25 plus reranker), every chunk tier-tagged at ingest, every query carrying the asker's tier and attributes, every answer carrying citations the asker is authorized to see. Audit log is immutable (WORM). Across 14 months: zero RBAC leaks across audit findings, 87% query satisfaction, 40 hours per quarter of compliance review time recovered.
Ingest-time tier tagging on every chunk
Tiers are not assigned at query time, they are baked into the chunk record the moment it enters the index. Apache Tika and Unstructured.io normalize across PDFs, Confluence pages, GitHub markdown, VDR exports; Presidio scans for PII and applies redactions with named-entity recovery so phrasing stays natural; the tier tag (T1 to T4) is written alongside the embedding. The tag travels with the chunk for its lifetime in the store.
Hybrid retrieval with ACL filter at gate 1
Every query carries the asker's JWT-bound tier and attribute claims through the Kong gateway. Before either retriever sees a vector, the ACL filter narrows the candidate set to chunks the asker is allowed to consider. Dense retrieval (Qdrant) and lexical retrieval (OpenSearch BM25) both fire against the filtered set; the cross-encoder reranker (BGE) scores the merged candidates against the original query.
Post-retrieval validation at gate 2
JWT tier is a 15-minute snapshot. ACLs can change inside that window, a revoke, a role-down, a project rotation. Gate 2 re-fetches the asker's current ACLs (Postgres source of truth, 60-second Redis cache) and drops any candidate that no longer passes. The JWT TTL is the safety net; the fresh lookup is the source of truth.
Post-LLM citation validation at gate 3
Once the LLM (Claude Sonnet 4.6, with a model-portable runtime and a Groq fallback path) produces an answer, gate 3 walks every emitted citation and confirms it sits in the asker's allow-list. If the model hallucinates a citation outside tier or fabricates a reference, the response is withheld and a refusal is returned with the trace id. Three independent gates, no shared trust.
Immutable WORM audit ledger
Every request, allowed or refused, writes a signed entry to S3 Object Lock in compliance mode with a 7-year retention. Hash-chained for tamper evidence. In compliance mode even the AWS root account cannot delete an object during retention. Auditors export directly; the GRC team stopped reconstructing trails by hand.
Three lanes. The ingest lane (top) parses, scans for PII, tags tier, embeds, writes to both stores. The query lane (middle) runs the request through Kong, JWT validation, fresh ACL resolve, hybrid retrieval, reranker, LLM, and guardrails. The audit lane (bottom) accepts a write from every stage that touches the request. Three padlocks mark the three-stage ACL gating: pre-retrieval filter, post-retrieval validation, post-LLM citation validation. The gates are the load-bearing idea; everything else is plumbing.
Stale-permission window: ACLs change inside a 15-minute JWT TTL, a revoke or role-down can land mid-session.
Fresh ACL lookup on every request against the Postgres source of truth, fronted by a 60-second Redis cache to keep latency budget. JWT TTL stays as the failure-mode fallback. The fresh lookup is the gate; the JWT is the floor.
Source-system identity reconciliation: Confluence groups, GitHub teams, VDR rooms, and HRIS roles all model permissions differently.
Ingest-time mapping to a single tier model with provenance kept per source. Conflicts are surfaced for human review at ingest, never at query, so the runtime gate sees a clean tag. Re-ingest reruns the mapping when source ACLs shift.
PII inside legitimately-accessible docs (a runbook that pastes a customer ID, an incident transcript with employee names).
Ingest-time Presidio scan with redaction and named-entity recovery so the surrounding phrasing stays natural, no jagged [REDACTED] in the middle of a sentence. Redaction state travels on the chunk and renders as a badge on the citation.
Auditor trust required immutability that even root could not break, soft-delete and IAM policies were not enough.
S3 Object Lock in compliance mode with 7-year retention. In compliance mode (as distinct from governance mode) no principal, including the root account, can delete or shorten retention during the lock period. The auditor signed off on this in week one.
Refusal correctness vs over-refusal: a model that refuses everything is technically correct and operationally useless.
Budget of 5% or fewer over-refusals, tracked weekly. Refusals are sampled by the GRC team and replayed; tier mappings or prompt-side scoping is adjusted. None missed in audit across 14 months; the over-refusal rate has stayed under budget for 11 of those months.
Auth + identity
- ·OIDC via Okta (Azure AD path supported)
- ·JWT RS256 with 15-min TTL + rotating refresh
- ·Attribute claims: tier, function, geo, on-call
Retrieval
- ·Qdrant, self-hosted (dense vectors)
- ·OpenSearch, self-hosted (BM25 lexical)
- ·BGE-large embedder + BGE cross-encoder reranker
LLM
- ·Claude Sonnet 4.6 (primary)
- ·Model-portable runtime (swap without rewrite)
- ·Groq fallback path for cost + capacity
Ingest
- ·Apache Tika + Unstructured.io (multi-format)
- ·Microsoft Presidio (PII scan + redaction)
- ·Tier tag stamped per chunk at write
Audit
- ·S3 Object Lock, compliance mode (WORM)
- ·Hash-chained ledger, 7-year retention
- ·Direct auditor export, no intermediate format
Gateway + ops
- ·Kong (gateway, rate, JWT validate)
- ·FastAPI orchestrator, OpenTelemetry traces
- ·Single-tenant deploy, customer-controlled cloud
Persona-Gated RAG
Four personas across the four sensitivity tiers, five sample queries against the corpus. Pick a persona, pick a query, submit. The response panel renders the answer the system would have produced for that asker, with the citations they are authorized to see, OR a refusal stamped with the tier requirement that blocked it. The audit-log block shows what was written to the WORM ledger for the request. The same query against different personas swings between authorized and refused, which is the entire point.
Cascade failure on 2024-08-14, the public summary attributes it to a config rollout in the edge proxy that bypassed the staged-rollout gate. Mitigation was a full rollback in 22 minutes. The exec summary you can see does not include customer names or the redacted incident-channel transcript.
- T1POST-2024-0814-execOutage 2024-08-14 · executive summary
- T1RB-edge-rollout-v3Edge config rollout runbook v3
- → Zero RBAC leaks across 14 months of audit findings
- → P95 query latency 4.2s end-to-end including three ACL gates
- → 87% query satisfaction (4-or-5-star) across six stakeholder functions
- → Compliance review time recovered, ~40 hours per quarter
Deployed on [REDACTED] platform-engineering org's customer-controlled cloud, single tenant. Six stakeholder functions onboarded across the first three quarters; new functions ship through the same ingest pipeline. SOC 2 Type II audited.
Deployed on customer-controlled cloud (single tenant). SOC 2 Type II audited. Sovereign deployment pattern available for regulated tenants. Audit ledger has been exported directly by external auditors across two full cycles.