Why we built an AI-native e-discovery tool
The existing tools were built for a world where "intelligent" meant keyword search and Boolean operators. We thought we could do better.
There's a version of this essay that starts with a statistic about how much American corporations spend on e-discovery every year. We're not going to write that version. You've seen the number. It's large. It didn't change your behavior. Numbers don't change behavior in legal — case outcomes do.
So instead, let's start with a case.
The case that broke our assumptions
A mid-market litigation boutique had a client in a commercial dispute — a construction contract gone wrong, two parties, three years of emails, about 400,000 documents in total. The firm had two associates, a deadline, and a review vendor quote that was going to cost their client more than the fee agreement allowed.
They were good lawyers. They knew the documents existed that would win the case. Somewhere in those 400,000 files was an internal email from the opposing party's project manager, written six weeks before the dispute escalated, that established exactly what the company knew and when they knew it. The kind of document that, once found, makes the case.
They never found it in time. They settled.
The problem wasn't effort or competence. The problem was a fundamentally broken tool stack. The review platform they were using — one of the three or four major vendors that dominate the market — was built in 2008 and had been updated cosmetically ever since. It did keyword search. It did Boolean operators. With the right add-on module, it would do "predictive coding," which in practice meant a supervised classification model that required thousands of human-reviewed training documents before it would do anything useful.
None of these tools could answer the only question that mattered: what did they know, and when did they know it? That question requires understanding relationships between entities across documents. It requires tracing a communication chain across custodians and dates. It requires reasoning, not matching.
The tool stack we wished existed
We didn't start by trying to build an e-discovery tool. We started by trying to answer a simpler question: what would a legal research assistant look like if you built it from scratch, today, with modern AI infrastructure?
The answer, we discovered quickly, required solving three distinct problems that the existing market hadn't solved in combination.
The ingestion problem. Legal documents come in every format imaginable: scanned PDFs with broken OCR, native Office files, audio recordings from depositions, email chains with nested threads and calendar attachments, privilege logs in Excel, Bates-stamped productions in a dozen different formats. Any serious e-discovery tool has to ingest all of it, reliably, without losing the structural information that makes documents meaningful. Most tools ingest text. They lose structure. They lose relationships between documents that aren't explicit in the text.
The retrieval problem. Once you have the documents, you have to find the ones that matter. Keyword search fails because it finds words, not meaning. A document about "the project manager's awareness of the defect" won't surface if you search for "knew about the problem." Semantic search — vector embeddings — is better, but it answers questions about what a single document says, not what a corpus of documents implies. The question "what did they know and when did they know it" requires reasoning across dozens of documents, not just finding the most semantically similar one.
The defensibility problem. Legal work product has to hold up in court. That means every AI-assisted output needs to be traceable: this answer came from these documents, on these pages, in this production set. The citation chain has to be auditable. If the AI is wrong — and it will sometimes be wrong — the attorney has to be able to find the error. The tool can't be a black box.
No tool in the market was solving all three. Some were solving one. None were solving them as a unified system designed from the ground up for legal work.
What we built
Ananse is our answer to those three problems.
The ingestion layer handles every document type we've encountered in production: PDFs (scanned and native), Word, PowerPoint, Excel, email in MSG and EML formats, audio and video files, images. When a document enters the system, it doesn't just get converted to text — it gets structured. Bates numbers are identified or assigned. Custodians are extracted and linked. Privilege flags are surfaced based on configurable criteria: attorney-to-client communications, work product markers, common in-house counsel indicators.
The retrieval layer is where we departed most from the existing market. We use a dual-database architecture: a vector store (Qdrant) for semantic similarity search, and a knowledge graph (Neo4j) for entity relationship traversal. When you ask Ananse a question, it doesn't just find the most similar documents — it finds the documents, traces the relationships between the entities they mention, and follows the graph to surface documents that are connected by meaning even when they're not connected by words.
We call this HippoRAG 2, after the hippocampal memory architecture it's inspired by. The human hippocampus doesn't just store facts — it indexes relationships between facts. When you remember an event, you recall not just the event but the context around it: who was there, what happened before, what happened after. We wanted retrieval to work the same way.
The defensibility layer is built into every response. Every answer Ananse generates comes with citations: the document, the page, the Bates number, the custodian. The attorney can click through to the source. If the AI made an error in its synthesis, the citation trail shows exactly where. The AI is a research assistant, not an oracle.
What we're still building
We shipped to our first customers earlier this year. The feedback has been what we expected: e-discovery workflows are deeply idiosyncratic. Every firm has its own privilege criteria, its own production format requirements, its own conventions for Bates numbering and custodian codes.
We're adding support for customizable privilege detection rules — not just the system defaults, but firm-specific keyword lists, attorney roster lookups, and domain-based privilege flags. We're improving the knowledge graph schema to support more complex relationship types: the distinction between a communication that establishes awareness and one that merely reflects it matters legally, and we want the graph to represent that distinction.
We're also thinking hard about the audit trail problem. As AI-assisted review becomes more common, courts are starting to ask questions about the review process itself: how was the training data selected, who reviewed the AI's outputs, what was the error rate. We think the tools that will win in this market are the ones that make the AI's work product fully auditable — not just the output, but the process.
Why now
The large e-discovery vendors have a prisoner's dilemma problem. They've built businesses on per-page and per-user pricing models that are structurally incompatible with AI assistance. The better the AI gets at doing what their associates used to do, the worse their economics get. They will improve their tools eventually — but slowly, and without breaking the pricing model that funds their current businesses.
We don't have that constraint. We built for AI from the start. The pricing model, the architecture, the workflow assumptions — all of it assumes that the machine is doing the mechanical work and the attorney is doing the legal work.
That's not a prediction about the future of legal AI. It's a description of what the tools should have done ten years ago, and what we're doing now.
If you're working a hard case and want to see what this looks like in practice, book a demo. We'll show you on your documents.
Want to see Ananse on your documents?
A 30-minute call. Your documents. No slide deck.