PDF Processing Guide
Process PDF files using Python libraries and command-line tools to perform operations such as reading, extracting text and tables, merging, splitting, rotating pages, adding watermarks, creating new PDFs, filling forms, encrypting/decrypting, extracting images, and performing OCR
8 steps · start to finish.
- 1Step 1
Environment Setup
▶Install all Python dependencies and verify CLI tools are available.
echo "=== Installing Python dependencies ===" pip install pypdf pdfplumber reportlab # Install optional dependencies based on operation OPERATION="${OPERATION:-extract-text}" if [[ "$OPERATION" == "ocr" ]]; then pip install pytesseract pdf2image fi if [[ "$OPERATION" == "extract-tables" ]]; then pip install pandas openpyxl fi echo "=== Checking CLI tools ===" command -v pdftotext >/dev/null 2>&1 && echo "pdftotext: OK" || echo "pdftotext: not found (install poppler-utils)" command -v qpdf >/dev/null 2>&1 && echo "qpdf: OK" || echo "qpdf: not found" command -v pdftk >/dev/null 2>&1 && echo "pdftk: OK" || echo "pdftk: not found (optional)" echo "=== Creating output directory ===" mkdir -p /app/resultsVerify Python imports succeed before proceeding:
from pypdf import PdfReader, PdfWriter import pdfplumber from reportlab.lib.pagesizes import letter print("All core dependencies imported successfully") - 2Step 2
Validate Inputs
▶Verify that the input PDF(s) exist and are readable before running any operation.
- 3Step 3
Execute PDF Operation
▶Choose the appropriate code block for the requested operation. Run the relevant section only.
- 4Step 4
Iterate on Errors (max 3 rounds)
▶If Step 3 raised an exception or produced an empty/corrupt output file:
- 5Step 5
Validate Outputs
▶Verify that all expected output files exist and are non-empty.
- 6Step 6
Write Executive Summary
▶Write `/app/results/summary.md` with a concise record of the run.
- 7Step 7
Write Validation Report
▶Write `/app/results/validation_report.json`.
- 8Step 8
Final Checklist (MANDATORY — do not skip)
▶echo "=== FINAL OUTPUT VERIFICATION ===" RESULTS_DIR="/app/results"