Docling — Smart Document Converter

Docling is an open-source Python library (from the LF AI & Data Foundation, originated at IBM Research) that converts messy documents into clean, structured formats perfect for generative AI (RAG, fine-tuning, agents, etc.). It excels especially on complex PDFs — scanned or digital — but handles many other formats too.

  • Input: PDF (scanned/digital), DOCX, PPTX, XLSX, HTML, images, audio (ASR)...
  • AI-powered: Layout (DocLayNet), Tables (TableFormer), Reading order, OCR, Figures, Formulas, Code blocks...
  • Output: clean Markdown, rich JSON, HTML
  • Unified DoclingDocument model
  • CPU-friendly, GPU acceleration optional
  • Integrates with LangChain, LlamaIndex, Haystack, CrewAI...
  • MIT license — free for commercial use

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869.pdf"
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document

print(doc.export_to_markdown()) # → great for LLM/RAG
print(doc.export_to_dict())      # → structured JSON
GitHub
Quality, Reliability & Service
Thank You For Visiting
Brooks Computing Systems - Jacksonville
Visit https://bcs.archman.us