Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Document Parsers

Synopsis

This tutorial describes how the lopdf crate provides PDF parsing functionality that can be combined with Data pipelines and Agentic AI. A pdf executable that provides PDF parsing is provided in the examples.

Tutorial

PDF parsers

Text is extracted from PDF documents and returned as columnar tables following the schema used for document processing in the phymes-agents session plans. Simple cleaning of the extracted text is provided by default, e.g., removing extra white spaces and joining lines and paragraphs. While the sequence of text is maintained, the hierarchy is lost and would require a proper OCR solution such as the NVIDIA NIMS nemoretriever-parse or PaddleOCR which would also provide image and table parsing and can be hosted as microservices and called via API in the future.

WASM compatibility

Text extraction from PDF documents is supported in WASM with simd128 vectorization acceleration when supported by the CPU.