Ask HN: Who is doing the best Word/PDF RAG tool with deep research?

Hi HN, which SaaS providers are you eyeing up these days for your RAG needs with thousands of PDFs or Word docs and with a agent that can take its time and give well researched, cited answers? TIA!

4 points | by _samjarman 4 days ago

4 comments

lisa_coicadan 1 day ago
Great thread, we’ve seen the exact same pain points around working with large volumes of complex PDFs/Word docs.
At Retab.com, we focus on the “hard pre-RAG” layer: turning raw documents : including scanned reports, OCR messes, financial statements, or regulatory filings... into clean, structured, model-ready data.
Instead of relying on embeddings over noisy text chunks, we use schema-driven generation, multi-LLM consensus, and an evaluation UI to ensure output is accurate, complete, and explainable. No manual parsing, no hallucinations, just structured JSON (or any format you want), ready for retrieval, agents, or analytics.
We work with teams doing RAG on contracts, audits, earnings reports, etc.. anywhere that “close enough” isn’t good enough. Happy to run your hardest docs through Retab if you want to benchmark against WFGY or LlamaParse
TXTOS 4 days ago
I've been working on something that directly targets this problem: WFGY — a reasoning engine built for RAG on large-scale PDF/Word documents, especially when you're doing deep research, not just shallow QA.
Instead of just chunking text and throwing it into an embedding model, WFGY builds a persistent semantic resonance layer — meaning it tracks context through formatting breaks, footnotes, diagram captions, even corrupted OCR sections.
The engine applies multiple self-correcting pathways (we call them BBMC and BBPF) so even when parsing is incomplete or wrong, reasoning still holds. That’s crucial if your source materials are academic papers, messy reports, or 1000+ page archives.
It’s open source. No tuning. Works with any LLM. No tricks.
Backed by the creator of tesseract.js (36k) — who gets why document mess is the real challenge.
Check it out: https://github.com/onestardao/WFGY
randomname4325 4 days ago
checkout www.Airwave.us. They are focused on field services where techs comb through thousands of pages of manuals/documentation for part numbers or specific instructions that have to be 100% accurate.
Norcim133 4 days ago
I spent the last 2 months trying out RAG/parsing plays. My use-case required high accuracy on complex tables and figures.
Ranking: 1. LlamaCloud/LlamaParse 2. GroundX 3. Unstructured.io 4. Google RAG Engine 5. Docling ... capability gap... 6. Azure - Document Intelligence 7. AWS - Textract 8. LlamaIndex (DIY)
[-]
- Imanari 4 days ago
  This ranking is just for the parsing, not the RAG Portion, correct?
  [-]
  - Norcim133 3 days ago
    Correct-ish. LlamaCloud and GroundX do everything up to retrieval. Here is an interactive graphic of major players along RAG flow: https://claude.ai/public/artifacts/b872435b-1d9c-461e-a29c-b...