You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

143 lines
6.5 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# CLAUDE.md — Letters & Poetry Project
## Project Overview
A Python project that downloads, parses, and displays historic love letters and classic poetry from Project Gutenberg. Data is pre-parsed and committed to git so end users don't need to download anything. A web UI is served via the `hicalsoft.github.io` GitHub Pages site.
## Repository Structure
```
├── download_letters.py # Downloads & parses 11 letter sources from Gutenberg
├── download_poetry.py # Downloads & parses 15 poetry sources from Gutenberg
├── love_letters.py # CLI app: displays random letters in the terminal
├── letters/ # Pre-parsed letter JSON files (11 sources, ~1,307 letters)
├── poetry/ # Pre-parsed poetry JSON files (15 sources, ~3,098 poems)
├── hicalsoft.github.io/ # Embedded repo for GitHub Pages web UI (separate git history)
│ ├── letters/index.html # Standalone SPA for browsing letters
│ ├── letters/data/letters.json
│ ├── poetry/index.html # Standalone SPA for browsing poetry
│ └── poetry/data/poetry.json
└── README.md
```
## Key Commands
```bash
# Download all letter sources (requires internet)
python3 download_letters.py
# Download all poetry sources (requires internet)
python3 download_poetry.py
# Run CLI app (no internet needed — reads from letters/ directory)
python3 love_letters.py # Random letter
python3 love_letters.py --list # List sources
python3 love_letters.py --source napoleon # Filter by source
```
## Data Pipeline
1. **Download scripts** fetch raw `.txt` files from Project Gutenberg via `urllib`
2. Each source has a **custom extractor function** that parses the Gutenberg text format
3. Parsed data is saved as JSON to `letters/` or `poetry/` directories
4. A separate step combines the individual JSON files into `letters.json` / `poetry.json` for the web UI (these live in `hicalsoft.github.io/*/data/`)
### Regenerating web UI data
```python
# Letters — run from project root
import json, os, glob
out = {"authors": {}, "letters": []}
for f in sorted(glob.glob("letters/*.json")):
data = json.load(open(f))
for l in data:
out["letters"].append({"a": l["author"], "r": l["recipient"], "h": l.get("heading",""), "b": l["body"], "s": l["source"], "p": l.get("period","")})
out["authors"].setdefault(l["author"], 0)
out["authors"][l["author"]] += 1
json.dump(out, open("hicalsoft.github.io/letters/data/letters.json","w"), separators=(",",":"))
# Poetry — same pattern with {authors, poems} structure
```
## Gutenberg Parsing Notes
- **Line endings**: Always normalize with `.replace("\r\n", "\n").replace("\r", "\n")` before regex splitting
- **START/END markers** vary: `"*** START OF THE PROJECT GUTENBERG EBOOK"`, `"*** START OF THIS PROJECT GUTENBERG EBOOK"`, etc. — use regex
- **Each source needs a custom extractor** due to unique formatting (Roman numerals, ALL CAPS titles, numbered entries, etc.)
- **CONTENTS sections** often duplicate the same headings as actual content — need to find the 2nd occurrence or verify context
- **Poe** is the most complex: uses section tracking (poem_sections vs non_poem_sections) to only extract from the 4 actual poetry sections, skipping memoir, notes, prose, essays
## Letter Sources (11)
| Source | Gutenberg ID | Extractor |
|--------|-------------|-----------|
| Henry VIII to Anne Boleyn | 22009 | `extract_henry_viii` |
| Mary Wollstonecraft to Gilbert Imlay | 3529 | `extract_wollstonecraft` |
| Abelard & Héloïse | 35977 | `extract_abelard_heloise` |
| Napoleon to Josephine | 37499 | `extract_napoleon` |
| Keats to Fanny Brawne | 35698 | `extract_keats_brawne` |
| Browning Letters Vol 1 | 50400 | `_extract_browning` |
| Browning Letters Vol 2 | 51263 | `_extract_browning` |
| Burns to Clarinda | 6131 | `extract_burns_clarinda` |
| Dorothy Osborne | 34387 | `extract_dorothy_osborne` |
| Beethoven | 13065 | `extract_beethoven` |
| Mozart | 5307 | `extract_mozart` |
## Poetry Sources (15)
| Source | Gutenberg ID | Extractor |
|--------|-------------|-----------|
| Shakespeare Sonnets | 1041 | `extract_shakespeare_sonnets` |
| Emily Dickinson | 12242 | `extract_dickinson` |
| Walt Whitman | 1322 | `extract_whitman` |
| William Blake | 1934 | `extract_blake` |
| John Keats | 23684 | `extract_keats` |
| Edgar Allan Poe | 10031 | `extract_poe` |
| E.B. Browning Sonnets | 2002 | `extract_browning_sonnets` |
| T.S. Eliot | 1321 | `extract_eliot_wasteland` |
| Robert Frost (Mountain) | 29345 | `extract_frost_mountain` |
| Robert Frost (Selected) | 59824 | `extract_frost_selected` |
| W.B. Yeats | 32233 | `extract_yeats` |
| Omar Khayyám | 246 | `extract_khayyam` |
| Robert Burns | 1279 | `extract_burns` |
| William Wordsworth | 9622 | `extract_wordsworth` |
| Percy Shelley | 4800 | `extract_shelley` |
## Web UI Architecture
Both `/letters` and `/poetry` pages are **standalone SPAs** (no Jekyll dependency). They:
- Load a single combined JSON file via `fetch()`
- Match the site's dark neumorphism theme (bg `#2b2d2f`, text `#fff`)
- Letters uses red accent (`#ff073a`), Poetry uses purple accent (`#c084fc`)
- Feature: author/poet sidebar, card grid, random button, detail view with font controls
- Keyboard nav: ←/→ arrows, Escape to go back, R for random
- Font auto-fit: calculates ideal font size from container width and longest line length
- Manual A+/A buttons override auto; click "auto" label to reset
## Change Log
### Fix Poe parser and add font size controls
- Rewrote `extract_poe()` with section tracking (poem_sections vs non_poem_sections)
- Only extracts from 4 poetry sections, skips memoir/notes/prose/essays/dedications
- Result: 51 clean poems (was 108 with junk)
- Added `_is_title()`, `_save_current()`, `_norm()` helper functions
- Added skip_titles set for sub-headings (PREFACE, dedications, etc.)
- Renames "Part I/II" → "Al Aaraaf — Part I/II"
### Add poetry collection
- Created `download_poetry.py` with 15 Gutenberg extractors
- 3,098 poems from 15 sources stored in `poetry/` as JSON
- Created `/poetry` web page matching site theme
### Remove letter truncation
- Removed `truncate_letter()` — shows full letter text
### Restructure: pre-downloaded letters
- Moved from download-on-run to pre-parsed JSON in `letters/`
- Added 6 new sources (Browning, Burns, Osborne, Beethoven, Mozart)
- 1,307 letters from 11 sources
### Initial love letters app
- Created `love_letters.py` CLI with 5 initial sources
- `download_letters.py` for fetching/parsing Gutenberg texts