Non-blocking preflight, or: a build that always produces a PDF

I document software for a living, and the habit that follows me home is validation at build time: the pipeline checks its own output before anyone downstream relies on it. A press-ready PDF is not software, but it earns the same treatment. Every book in this pipeline ends the same way: a LuaTeX build turns the source into a single PDF that gets handed to a commercial print shop. Between “the build finished” and “the file is on press” sits one step that decides whether the next weeks go smoothly or expensively - preflight, the prepress check that the PDF actually meets the printer’s specifications: the right page boxes, every font embedded, images in CMYK at sufficient resolution, an output intent attached, and no structural surprises.

This matters because a PDF can look perfect on screen and still be wrong on press. An image left in RGB, a font the build forgot to embed, a photo placed at 280 ppi - none of these are visible in a PDF viewer, and all of them are caught, and charged for, at the print shop. So the interesting question is not which checks to run - those simply mirror the print shop’s own list - but what the build should do when a check fails. That turns out to be a design decision with real consequences, and the choice here is deliberately unusual.

The design philosophy

Preflight can fail two ways. The first failure is obvious: the preflight tool aborts the build, no PDF is produced, and whoever triggered the build gets an error message with no artifact. The second failure is subtler: the preflight tool silently produces a file that looks correct on screen but fails at the print shop - and nobody finds out until the file has been submitted.

Both failures are bad. But only one of them is recoverable cheaply.

The design philosophy here is different from both: the build always produces a PDF. If all checks pass, the file keeps its name: planning.pdf. If any check fails, the file is renamed: planning-not-print-ready.pdf. A report listing every failure is written alongside it. Nothing ever aborts mid-build due to a preflight failure.

This has three consequences:

The artifact is always available. Someone waiting for the PDF gets a PDF - with a name that makes its status unambiguous. They can look at it, verify the defect, and decide whether it’s blocking.
Failures are explicit and actionable. The report names the check, the page, the measured value, and enough detail to locate the specific defect. “xref 42: 214 ppi” is actionable. “preflight failed” is not.
CI can enforce hard failures without changing the naming logic. The --strict flag converts soft failures into a non-zero exit code - useful for blocking a merge - without altering how the file is named or the report is written.

The one hard stop is a LaTeX compilation failure. If lualatex returns non-zero, there is no PDF to rename. That failure exits immediately. Everything else is soft.

The one hard stop is a compile failure - there is nothing to rename. Everything else is soft: the build always produces a PDF whose name states its status, and --strict controls only the exit code, not the naming.

The six checks

preflight.py runs six checks using pikepdf (for PDF structure) and PyMuPDF/fitz (for image analysis):

import pikepdf, fitz
from pathlib import Path

MM_TO_PT = 72 / 25.4

def run_preflight(pdf_path: Path) -> list[dict]:
    results = []

    def record(check, ok, detail=""):
        results.append({"check": check, "ok": ok, "detail": detail})

    with pikepdf.open(pdf_path) as pdf:
        # [1] Box geometry: BleedBox 424×301 mm, TrimBox A3 420×297 mm
        for i, page in enumerate(pdf.pages):
            mb = [float(v) for v in page.mediabox]
            tb = [float(v) for v in page.get("/TrimBox", page.mediabox)]
            w_mb, h_mb = mb[2] - mb[0], mb[3] - mb[1]
            w_tb, h_tb = tb[2] - tb[0], tb[3] - tb[1]
            ok = (abs(w_mb - 424 * MM_TO_PT) < 1 and abs(h_mb - 301 * MM_TO_PT) < 1
                  and abs(w_tb - 420 * MM_TO_PT) < 1 and abs(h_tb - 297 * MM_TO_PT) < 1)
            record("boxes", ok,
                   f"p{i+1}: MediaBox {w_mb:.1f}x{h_mb:.1f} pt  "
                   f"TrimBox {w_tb:.1f}x{h_tb:.1f} pt")

        # [2] Font embedding - follow /DescendantFonts for Type0 (CJK) fonts,
        #     whose descriptor lives on the descendant, not the top-level dict
        def font_is_embedded(font) -> bool:
            descendants = font.get("/DescendantFonts")
            if descendants is not None:
                return all(font_is_embedded(df) for df in descendants)
            fd = font.get("/FontDescriptor", {})
            return bool(fd.get("/FontFile") or fd.get("/FontFile2")
                        or fd.get("/FontFile3"))

        for i, page in enumerate(pdf.pages):
            for name, font in page.get("/Resources", {}).get("/Font", {}).items():
                emb = font_is_embedded(font)
                record("fonts", emb,
                       f"p{i+1}: {name} {'embedded' if emb else 'NOT embedded'}")

        # [5] OutputIntent
        oi = pdf.Root.get("/OutputIntents")
        record("output_intent", bool(oi),
               "present" if oi else "missing - check FOGRA39 path in pdfx options")

        # [6] PDF version 1.6, no encryption, no JavaScript
        record("version",    pdf.pdf_version == "1.6", pdf.pdf_version)
        record("encryption", not pdf.is_encrypted, "")
        record("javascript",
               not any(k in pdf.Root for k in {"/JS", "/JavaScript"}), "")

    # [3, 4] Color space and resolution via PyMuPDF
    doc = fitz.open(str(pdf_path))
    for page in doc:
        for img_info in page.get_images(full=True):
            xref = img_info[0]
            pix  = fitz.Pixmap(doc, xref)
            cs   = pix.colorspace.name if pix.colorspace else "unknown"
            # rely on the color-space name; pix.n == 4 also matches RGBA, so it
            # is not a safe CMYK signal
            record("colorspace", "CMYK" in cs,
                   f"xref {xref}: {cs}")
            rects = page.get_image_rects(xref)
            if rects:
                r   = rects[0]
                ppi = (min(pix.width  / (r.width  / 72),
                           pix.height / (r.height / 72))
                       if r.width and r.height else 0)
                record("resolution", ppi >= 300, f"xref {xref}: {ppi:.0f} ppi")

    return results

The six checks mirror exactly what a commercial print shop’s preflight tool will verify:

#	Check	Library	What it catches
1	Box geometry	pikepdf	TrimBox and BleedBox not matching A3+2mm spec
2	Font embedding	pikepdf	Any font not embedded or subsetted
3	Color space	PyMuPDF	Any image not in CMYK
4	Resolution	PyMuPDF	Any image below 300 ppi at placed size
5	OutputIntent	pikepdf	Missing FOGRA39 ICC profile
6	PDF validity	pikepdf	Wrong PDF version, encryption, or JavaScript

Every check calls record() regardless of pass or fail. The function always returns. No check can abort the build.

Two checks that need care

Two of these checks have a non-obvious failure mode, and the code above is written to avoid both - worth spelling out, because the naive version of each is what most people write first and it passes on simple documents while breaking on this one (the book with Chinese calligrams and multilingual glossary terms).

The font check can’t just read /FontDescriptor from the top-level font dictionary. That works for simple fonts, but composite (Type0) fonts - the kind LuaTeX produces for CJK and other large character sets - carry the descriptor on a descendant font instead. Read only the top level and font.get("/FontDescriptor", {}) comes back empty, so a perfectly embedded CJK font is reported as “NOT embedded”: a false failure on exactly the glyphs this book depends on. The font_is_embedded helper above handles it by recursing through /DescendantFonts before concluding anything.

The color-space check must key off the color-space name, not the channel count. It’s tempting to treat pix.n == 4 as “four channels, therefore CMYK,” but n includes alpha - an RGBA image also has n == 4 and would pass a check it should fail. Testing "CMYK" in cs is the reliable signal; the channel-count shortcut is dropped for that reason.

Neither is exotic, and that’s the point: a false “font not embedded” failure on every build trains people to ignore the report, which is the one thing a preflight cannot afford. The cost of getting these right is a few lines; the cost of getting them wrong is a check nobody trusts.

The orchestrator

build_print.py collects image-preparation defects from prepare_images.py and preflight defects from run_preflight(), then decides the output name:

# 3. Preflight
checks     = run_preflight(out_pdf)
all_issues = img_issues + [c for c in checks if not c["ok"]]

for c in checks:
    mark = "OK" if c["ok"] else "FAIL"
    print(f"  [{mark}]  {c['check']:<16}  {c['detail']}")

# 4. Name the output
if all_issues:
    final  = base.parent / (base.name + "-not-print-ready.pdf")
    report = base.parent / (base.name + "-preflight.txt")
    out_pdf.rename(final)
    report.write_text("\n".join(
        f"[FAIL] {i.get('check','')}  {i.get('detail','')}"
        for i in all_issues
    ))
    print(f"\n  [FAIL]  PDF not print-ready -> {final.name}")
    print(f"          Report             -> {report.name}")
    return 1 if args.strict else 0

print(f"\n  [OK]  All checks passed -> {out_pdf.name}")
return 0

The naming logic and the CI exit code are independent. --strict adds a non-zero return when there are failures, which is what CI needs to block a merge. Without --strict, the build exits 0 even with failures - the file is renamed and the report is written, but the pipeline continues.

Why this matters in practice

Consider two failure scenarios:

Scenario A: an image that went through the wrong conversion path and is still RGB. The preflight catches it. The PDF is renamed planning-not-print-ready.pdf. The report says [FAIL] colorspace xref 12: DeviceRGB. The designer opens the report, runs prepare_images.py again on the correct source, rebuilds. The file is corrected before it leaves the build environment.

Scenario B: an image that passed all checks but whose resolution is 280 ppi. The preflight catches it. The report says [FAIL] resolution xref 8: 280 ppi. The image either needs to be resourced at higher resolution or the placed size needs to reduce - the resolution is checked at placed size, not at the image’s native size. The designer has the information to make the right call.

In both cases, the designer has a PDF to look at while reading the report. They can correlate the failing check with the visible output. The diagnosis is faster than if the build had aborted with no artifact.

The extension point

run_preflight() returns a plain list of dicts. Plugging in a more rigorous validator - veraPDF, for instance - means appending its results to the same list before returning. The naming and reporting logic in build_print.py is unchanged. The check list grows; the contract stays the same.

This is the same separation-of-concerns principle that structures the rest of the pipeline: each function does one thing, returns a consistent structure, and doesn’t reach into the orchestrator’s decisions.

Summing up

The build always produces a PDF. Pass and the file keeps its name; fail and it is renamed planning-not-print-ready.pdf with a report alongside - a preflight failure never aborts the build.
There is exactly one hard stop. A lualatex compile failure leaves nothing to rename and exits immediately; every other check is soft.
Failures are explicit and actionable. “xref 42: 214 ppi” names the check, page, and measured value; “preflight failed” does not.
Six checks mirror the print shop - box geometry, font embedding, color space, resolution at placed size, OutputIntent, and PDF validity - via pikepdf and PyMuPDF.
Two checks need care. Composite (Type0) fonts carry their descriptor on a descendant, and channel count includes alpha - get either wrong and the report cries wolf, which trains people to ignore it.
Naming and exit code are independent. --strict turns soft failures into a non-zero exit for CI without changing how the file is named or the report written.
The contract is an extension point. run_preflight() returns a plain list of dicts; a stricter validator like veraPDF appends to the same list and the orchestrator is unchanged.

External sources

Hero image: “Printing machines” by Neon Tommy, licensed under CC BY-SA 2.0.