Add CLAUDE.md project documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4 months ago · 6b308d3a01
parent 49d11e8869
commit 6b308d3a01
2 changed files with 143 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,142 @@
 # CLAUDE.md — Letters & Poetry Project
 ## Project Overview
 A Python project that downloads, parses, and displays historic love letters and classic poetry from Project Gutenberg. Data is pre-parsed and committed to git so end users don't need to download anything. A web UI is served via the `hicalsoft.github.io` GitHub Pages site.
 ## Repository Structure
 ```
 ├── download_letters.py          # Downloads & parses 11 letter sources from Gutenberg
 ├── download_poetry.py           # Downloads & parses 15 poetry sources from Gutenberg
 ├── love_letters.py              # CLI app: displays random letters in the terminal
 ├── letters/                     # Pre-parsed letter JSON files (11 sources, ~1,307 letters)
 ├── poetry/                      # Pre-parsed poetry JSON files (15 sources, ~3,098 poems)
 ├── hicalsoft.github.io/         # Embedded repo for GitHub Pages web UI (separate git history)
 │   ├── letters/index.html       # Standalone SPA for browsing letters
 │   ├── letters/data/letters.json
 │   ├── poetry/index.html        # Standalone SPA for browsing poetry
 │   └── poetry/data/poetry.json
 └── README.md
 ```
 ## Key Commands
 ```bash
 # Download all letter sources (requires internet)
 python3 download_letters.py
 # Download all poetry sources (requires internet)
 python3 download_poetry.py
 # Run CLI app (no internet needed — reads from letters/ directory)
 python3 love_letters.py                  # Random letter
 python3 love_letters.py --list           # List sources
 python3 love_letters.py --source napoleon  # Filter by source
 ```
 ## Data Pipeline
 1. **Download scripts** fetch raw `.txt` files from Project Gutenberg via `urllib`
 2. Each source has a **custom extractor function** that parses the Gutenberg text format
 3. Parsed data is saved as JSON to `letters/` or `poetry/` directories
 4. A separate step combines the individual JSON files into `letters.json` / `poetry.json` for the web UI (these live in `hicalsoft.github.io/*/data/`)
 ### Regenerating web UI data
 ```python
 # Letters — run from project root
 import json, os, glob
 out = {"authors": {}, "letters": []}
 for f in sorted(glob.glob("letters/*.json")):
    data = json.load(open(f))
    for l in data:
        out["letters"].append({"a": l["author"], "r": l["recipient"], "h": l.get("heading",""), "b": l["body"], "s": l["source"], "p": l.get("period","")})
        out["authors"].setdefault(l["author"], 0)
        out["authors"][l["author"]] += 1
 json.dump(out, open("hicalsoft.github.io/letters/data/letters.json","w"), separators=(",",":"))
 # Poetry — same pattern with {authors, poems} structure
 ```
 ## Gutenberg Parsing Notes
 - **Line endings**: Always normalize with `.replace("\r\n", "\n").replace("\r", "\n")` before regex splitting
 - **START/END markers** vary: `"*** START OF THE PROJECT GUTENBERG EBOOK"`, `"*** START OF THIS PROJECT GUTENBERG EBOOK"`, etc. — use regex
 - **Each source needs a custom extractor** due to unique formatting (Roman numerals, ALL CAPS titles, numbered entries, etc.)
 - **CONTENTS sections** often duplicate the same headings as actual content — need to find the 2nd occurrence or verify context
 - **Poe** is the most complex: uses section tracking (poem_sections vs non_poem_sections) to only extract from the 4 actual poetry sections, skipping memoir, notes, prose, essays
 ## Letter Sources (11)
 | Source | Gutenberg ID | Extractor |
 |--------|-------------|-----------|
 | Henry VIII to Anne Boleyn | 22009 | `extract_henry_viii` |
 | Mary Wollstonecraft to Gilbert Imlay | 3529 | `extract_wollstonecraft` |
 | Abelard & Héloïse | 35977 | `extract_abelard_heloise` |
 | Napoleon to Josephine | 37499 | `extract_napoleon` |
 | Keats to Fanny Brawne | 35698 | `extract_keats_brawne` |
 | Browning Letters Vol 1 | 50400 | `_extract_browning` |
 | Browning Letters Vol 2 | 51263 | `_extract_browning` |
 | Burns to Clarinda | 6131 | `extract_burns_clarinda` |
 | Dorothy Osborne | 34387 | `extract_dorothy_osborne` |
 | Beethoven | 13065 | `extract_beethoven` |
 | Mozart | 5307 | `extract_mozart` |
 ## Poetry Sources (15)
 | Source | Gutenberg ID | Extractor |
 |--------|-------------|-----------|
 | Shakespeare Sonnets | 1041 | `extract_shakespeare_sonnets` |
 | Emily Dickinson | 12242 | `extract_dickinson` |
 | Walt Whitman | 1322 | `extract_whitman` |
 | William Blake | 1934 | `extract_blake` |
 | John Keats | 23684 | `extract_keats` |
 | Edgar Allan Poe | 10031 | `extract_poe` |
 | E.B. Browning Sonnets | 2002 | `extract_browning_sonnets` |
 | T.S. Eliot | 1321 | `extract_eliot_wasteland` |
 | Robert Frost (Mountain) | 29345 | `extract_frost_mountain` |
 | Robert Frost (Selected) | 59824 | `extract_frost_selected` |
 | W.B. Yeats | 32233 | `extract_yeats` |
 | Omar Khayyám | 246 | `extract_khayyam` |
 | Robert Burns | 1279 | `extract_burns` |
 | William Wordsworth | 9622 | `extract_wordsworth` |
 | Percy Shelley | 4800 | `extract_shelley` |
 ## Web UI Architecture
 Both `/letters` and `/poetry` pages are **standalone SPAs** (no Jekyll dependency). They:
 - Load a single combined JSON file via `fetch()`
 - Match the site's dark neumorphism theme (bg `#2b2d2f`, text `#fff`)
 - Letters uses red accent (`#ff073a`), Poetry uses purple accent (`#c084fc`)
 - Feature: author/poet sidebar, card grid, random button, detail view with font controls
 - Keyboard nav: ←/→ arrows, Escape to go back, R for random
 - Font auto-fit: calculates ideal font size from container width and longest line length
 - Manual A+/A− buttons override auto; click "auto" label to reset
 ## Change Log
 ### Fix Poe parser and add font size controls
 - Rewrote `extract_poe()` with section tracking (poem_sections vs non_poem_sections)
 - Only extracts from 4 poetry sections, skips memoir/notes/prose/essays/dedications
 - Result: 51 clean poems (was 108 with junk)
 - Added `_is_title()`, `_save_current()`, `_norm()` helper functions
 - Added skip_titles set for sub-headings (PREFACE, dedications, etc.)
 - Renames "Part I/II" → "Al Aaraaf — Part I/II"
 ### Add poetry collection
 - Created `download_poetry.py` with 15 Gutenberg extractors
 - 3,098 poems from 15 sources stored in `poetry/` as JSON
 - Created `/poetry` web page matching site theme
 ### Remove letter truncation
 - Removed `truncate_letter()` — shows full letter text
 ### Restructure: pre-downloaded letters
 - Moved from download-on-run to pre-parsed JSON in `letters/`
 - Added 6 new sources (Browning, Burns, Osborne, Beethoven, Mozart)
 - 1,307 letters from 11 sources
 ### Initial love letters app
 - Created `love_letters.py` CLI with 5 initial sources
 - `download_letters.py` for fetching/parsing Gutenberg texts
--- a/hicalsoft.github.io
+++ b/hicalsoft.github.io
@ -1 +1 @@
-Subproject commit 7fd16c08b20d81da497d2efb44af2e83860382f4
+Subproject commit c292f8eed1677afa7de015a8a32098f7b3f52956
		`@ -1 +1 @@`
			`Subproject commit 7fd16c08b20d81da497d2efb44af2e83860382f4`				`Subproject commit c292f8eed1677afa7de015a8a32098f7b3f52956`