We’ll explore how to leverage a corpus of 1,800 articles (~1.8 million words — comparable to A Song of Ice and Fire) to build an AI-driven discovery platform that makes reading and exploration smarter and more meaningful.
These articles were originally transcripts of lectures recorded since the early 2000s. For years, they existed only as 37 separate Word files — some created in Word 97 — each identified only by its date. They were never published online, but gathered once a year into printed booklets, directly exported from Word, with minimal formatting and uneven layout.
Using a custom Python script and Pandoc, the collection was converted into individual Markdown files, one per lecture. An AI system then analyzed each text to create meaningful titles, add intermediate headings, and generate concise SEO descriptions — transforming a chronological archive into a structured, readable, and accessible body of work.
The goal is to transform a vast, heterogeneous collection of articles into a structured, AI-enabled knowledge platform. This system should:

The ultimate aim is to make articles more discoverable, approachable, and meaningful, without losing the depth and subtlety of the original articles.
Before AI can meaningfully process the corpus, the data must be clean and consistent:
This step ensures a high-quality foundation, which is critical for metadata generation, indexing, and AI interaction.
Metadata transforms a large corpus into a discoverable and analyzable knowledge base:
AI can assist here by reading each article and suggesting thematic tags or highlights while maintaining consistency across the corpus.
A robust indexing system enables fast, accurate retrieval:
This ensures that users can explore the corpus efficiently, whether seeking a specific lesson or discovering related concepts.
The AI layer transforms static content into an intelligent, interactive discovery platform:
This layer bridges the gap between static text and dynamic understanding, making the articles more accessible to practitioners and researchers alike.
Finally, the platform should be easily accessible:
We can use a Python script to automate the evaluation and enrichment of Markdown files by assigning a semantic score related to any theme, for instance, daily life
It works in three stages:
Distribution of articles across combined scores, shown as a Gaussian curve

Gaussian Normalization To ensure a smooth and realistic distribution of results, the raw scores are rebalanced using a Gaussian (normal) curve, spreading the final values evenly across the 1–10 range. This avoids clusters of identical scores and makes large-scale data more insightful.
AI-Generated Metadata Example
The metadata for this article was generated from an AI-assisted analysis of hundreds of articles. By identifying recurring themes of awareness, balance, and mindful attention, the AI distilled key insights into how stillness can nurture clarity and calm in everyday life.
title: The Power of Stillness – Deepening Awareness Through Mindful Sitting
sourceLanguage: en
description: Explore how mindful sitting cultivates presence, clarity, and balance in everyday life through sustained attention and inner calm.
lastUpdated: 2009-10-31T11:00:00Z
wordCount: 1586
keywords:
- Mindfulness
- Meditation
- Awareness
- Focus
- Presence
- Inner Calm
- Clarity
- Balance
- Attention
- Well-being
interest_score: 10
daily_life: 8
mindful_posture: 8
intentional_action: 7
continuity_of_practice: 10
Frontmatter Update The script then writes the final score back into each file’s YAML frontmatter as a new or updated field.
The result: a harmonized dataset of Markdown texts, each carrying a consistent, AI-generated metric of thematic relevance — ready for visualization, filtering, or content analytics.
No one could realistically read 1.8 million words of articles in print — even a skilled reader would need about 120 hours, or roughly two full weeks of reading, to get through them all. Since the full corpus of 1,800 articles is already available online, the real challenge is curation: selecting and assembling themed anthologies that present readers with beautifully printed excerpts tailored to their interests.
Yet another Python script can automate the conversion and assembly of Markdown articles into a single, beautifully formatted LaTeX book — ready to compile with XeLaTeX or LuaLaTeX on Overleaf.
It performs the entire workflow in a few elegant steps:
Markdown Cleaning and Conversion Given a Markdown file, the script:
Each .md file becomes a clean, self-contained .tex chapter with a proper \chapter{Title} heading.
The pages below are an excerpt from the automatically generated LaTeX PDF of an IA-curated list of articles. They illustrate the output of an AI pipeline designed to analyze large collections of mindfulness writings, extract thematic patterns, and produce well-structured, print-ready documents. The layout demonstrates how semantic analysis and automated typesetting can work together to transform raw text data into readable, publication-quality material.
Chapter Management Every generated chapter file is recorded in a hidden list (.chapters_list.txt). This ensures that chapters are automatically tracked and appear in the correct order when the book is rebuilt.
Book Assembly The script then reconstructs a master file, main.tex, combining:
The result is a complete, typeset-ready LaTeX book automatically regenerated each time a new article is added.
This script bridges editorial workflow and craftsmanship: it transforms hundreds of Markdown articles into a unified anthology.
The result is a scalable AI-powered publishing platform where users can:
Readers are offered a way to rediscover meaningful content through both digital exploration and carefully crafted print editions—bridging technology and editorial quality.