anthropics / pdf★ Featured · worked examples

PDF Processing Guide

Process PDF files using Python libraries and command-line tools to perform operations such as reading, extracting text and tables, merging, splitting, rotating pages, adding watermarks, creating new PDFs, filling forms, encrypting/decrypting, extracting…

agent claude-codemodel anthropic/claude-sonnet-4.6snapshot python312-uveval programmatic8 stepsv1.2.1

Deploy PDF Processing Guide to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

Deploy on jetty.io →View source

Run time1–2 mins

Headline outputextracted_tables.json · output.pdf · extracted_text.txt

Runs on Jetty's managed sandbox. No setup. Free for your first 10 runs.

Worked examples · 5

Real runs, real outputs.

extract-tablesextracteval ✓

Research-paper tables → JSON

Extract every table from a 20-page PLOS Medicine article into one strictly-valid JSON file (empty cells null, word spacing recovered).

extracted_tables.jsonextracted_tables.xlsxextracted_text.txtsummary.mdvalidation_report.json

claude-sonnet-4-6 · 2 minView run →Source

watermarkmodifyeval ✓

Stamp a PDF CONFIDENTIAL

Overlay a bold diagonal CONFIDENTIAL watermark on all 20 pages and emit a single watermarked PDF — and only that PDF.

output.pdfsummary.mdvalidation_report.json

claude-sonnet-4-6 · 52sView run →Source

createauthoreval ✓

Author a one-page summary PDF

Generate a clean, branded one-page Document Summary PDF from scratch — source title, page count and table count — proving the toolkit authors PDFs, not just consumes…

output.pdfsummary.mdvalidation_report.json

claude-sonnet-4-6 · 86sView run →Source

extract-tablesextracteval ✓

Invoice line items → JSON

Extract a commercial invoice's line-items table (item, qty, unit price, amount) into strictly-valid JSON — the canonical document-to-data use case.

extracted_tables.jsonextracted_tables.xlsxsummary.mdvalidation_report.json

claude-sonnet-4-6 · 2 minView run →Source

extract-textextracteval ✓

Federal RFP → full text

Extract all text from a 66-page federal solicitation (DISA RFP HC1047-05-R-4009) — ~184k characters for search/RAG. A robustness test on a large, form-heavy government…

extracted_text.txtsummary.mdvalidation_report.json

claude-sonnet-4-6 · 77sView run →Source

The shape of the run

8 steps · start to finish.

Step 1

Environment Setup

▶

Install all Python dependencies and verify CLI tools are available.

echo "=== Installing Python dependencies ==="
pip install pypdf pdfplumber reportlab

# Install optional dependencies based on operation
OPERATION="${OPERATION:-extract-text}"
if [[ "$OPERATION" == "ocr" ]]; then
  pip install pytesseract pdf2image
fi
if [[ "$OPERATION" == "extract-tables" ]]; then
  pip install pandas openpyxl
fi

echo "=== Checking CLI tools ==="
command -v pdftotext >/dev/null 2>&1 && echo "pdftotext: OK" || echo "pdftotext: not found (install poppler-utils)"
command -v qpdf      >/dev/null 2>&1 && echo "qpdf: OK"      || echo "qpdf: not found"
command -v pdftk     >/dev/null 2>&1 && echo "pdftk: OK"     || echo "pdftk: not found (optional)"

echo "=== Creating output directory ==="
mkdir -p /app/results

Verify Python imports succeed before proceeding:

from pypdf import PdfReader, PdfWriter
import pdfplumber
from reportlab.lib.pagesizes import letter
print("All core dependencies imported successfully")

2
Step 2
Validate Inputs
▶
Verify that the input PDF(s) exist and are readable before running any operation.
3
Step 3
Execute PDF Operation
▶
Choose the appropriate code block for the requested operation. Run the relevant section only.
4
Step 4
Iterate on Errors (max 3 rounds)
▶
If Step 3 raised an exception or produced an empty/corrupt output file:
5
Step 5
Validate Outputs
▶
Verify that all expected output files exist and are non-empty. For JSON outputs, also verify they parse as strictly-valid JSON.
6
Step 6
Write Executive Summary
▶
Write /app/results/summary.md with a concise record of the run.
7
Step 7
Write Validation Report
▶
Write /app/results/validation_report.json.
8
Step 8
Final Checklist (MANDATORY — do not skip)
▶
echo "=== FINAL OUTPUT VERIFICATION ===" RESULTS_DIR="/app/results"

Inputs

Input PDF(s)filerequired

One or more PDF files. Uploaded files land in /app/assets/.

Operationselectrequired

mergesplitextract-textextract-tablesrotatewatermarkcreateencrypt

Pick an operation, or describe it in natural language (e.g. "extract the tables into JSON", "stamp every page CONFIDENTIAL"). Honored as of runbook v1.2.0 — the runbook runs ONLY the chosen operation and emits only its deliverables.

Output filenametext

default: output.pdf

Name for the primary output PDF (structural / create operations).

Page rangetext

Page range for split/extract operations.

Rotation degreesnumber

default: 90

Degrees to rotate (rotate operation).

Passwordpassword

Password for encrypt/decrypt operations.

Dependencies

pypdf · required · Python package
pdfplumber · required · Python package
reportlab · required · Python package
pytesseract · optional · Python package
pdf2image · optional · Python package
pandas · optional · Python package
openpyxl · optional · Python package
poppler-utils · optional · System package
qpdf · optional · System package
pdftk · optional · System package
tesseract-ocr · optional · System package

Origin

source: skills.sh
title: PDF Processing Guide
attr: high

Original →

PDF Processing Guide

Deploy PDF Processing Guide to your jetty.io

Real runs, real outputs.

Research-paper tables → JSON

Stamp a PDF CONFIDENTIAL

Author a one-page summary PDF

Invoice line items → JSON

Federal RFP → full text

8 steps · start to finish.

Environment Setup

Validate Inputs

Execute PDF Operation

Iterate on Errors (max 3 rounds)

Validate Outputs

Write Executive Summary

Write Validation Report

Final Checklist (MANDATORY — do not skip)