|
|
# CLAUDE.md — Letters & Poetry Project
|
|
|
|
|
|
## Project Overview
|
|
|
|
|
|
A Python project that downloads, parses, and displays historic love letters and classic poetry from Project Gutenberg. Data is pre-parsed and committed to git so end users don't need to download anything. A web UI is served via the `hicalsoft.github.io` GitHub Pages site.
|
|
|
|
|
|
## Repository Structure
|
|
|
|
|
|
```
|
|
|
├── download_letters.py # Downloads & parses 11 letter sources from Gutenberg
|
|
|
├── download_poetry.py # Downloads & parses 15 poetry sources from Gutenberg
|
|
|
├── love_letters.py # CLI app: displays random letters in the terminal
|
|
|
├── letters/ # Pre-parsed letter JSON files (11 sources, ~1,307 letters)
|
|
|
├── poetry/ # Pre-parsed poetry JSON files (15 sources, ~3,098 poems)
|
|
|
├── hicalsoft.github.io/ # Embedded repo for GitHub Pages web UI (separate git history)
|
|
|
│ ├── letters/index.html # Standalone SPA for browsing letters
|
|
|
│ ├── letters/data/letters.json
|
|
|
│ ├── poetry/index.html # Standalone SPA for browsing poetry
|
|
|
│ └── poetry/data/poetry.json
|
|
|
└── README.md
|
|
|
```
|
|
|
|
|
|
## Key Commands
|
|
|
|
|
|
```bash
|
|
|
# Download all letter sources (requires internet)
|
|
|
python3 download_letters.py
|
|
|
|
|
|
# Download all poetry sources (requires internet)
|
|
|
python3 download_poetry.py
|
|
|
|
|
|
# Run CLI app (no internet needed — reads from letters/ directory)
|
|
|
python3 love_letters.py # Random letter
|
|
|
python3 love_letters.py --list # List sources
|
|
|
python3 love_letters.py --source napoleon # Filter by source
|
|
|
```
|
|
|
|
|
|
## Data Pipeline
|
|
|
|
|
|
1. **Download scripts** fetch raw `.txt` files from Project Gutenberg via `urllib`
|
|
|
2. Each source has a **custom extractor function** that parses the Gutenberg text format
|
|
|
3. Parsed data is saved as JSON to `letters/` or `poetry/` directories
|
|
|
4. A separate step combines the individual JSON files into `letters.json` / `poetry.json` for the web UI (these live in `hicalsoft.github.io/*/data/`)
|
|
|
|
|
|
### Regenerating web UI data
|
|
|
|
|
|
```python
|
|
|
# Letters — run from project root
|
|
|
import json, os, glob
|
|
|
out = {"authors": {}, "letters": []}
|
|
|
for f in sorted(glob.glob("letters/*.json")):
|
|
|
data = json.load(open(f))
|
|
|
for l in data:
|
|
|
out["letters"].append({"a": l["author"], "r": l["recipient"], "h": l.get("heading",""), "b": l["body"], "s": l["source"], "p": l.get("period","")})
|
|
|
out["authors"].setdefault(l["author"], 0)
|
|
|
out["authors"][l["author"]] += 1
|
|
|
json.dump(out, open("hicalsoft.github.io/letters/data/letters.json","w"), separators=(",",":"))
|
|
|
|
|
|
# Poetry — same pattern with {authors, poems} structure
|
|
|
```
|
|
|
|
|
|
## Gutenberg Parsing Notes
|
|
|
|
|
|
- **Line endings**: Always normalize with `.replace("\r\n", "\n").replace("\r", "\n")` before regex splitting
|
|
|
- **START/END markers** vary: `"*** START OF THE PROJECT GUTENBERG EBOOK"`, `"*** START OF THIS PROJECT GUTENBERG EBOOK"`, etc. — use regex
|
|
|
- **Each source needs a custom extractor** due to unique formatting (Roman numerals, ALL CAPS titles, numbered entries, etc.)
|
|
|
- **CONTENTS sections** often duplicate the same headings as actual content — need to find the 2nd occurrence or verify context
|
|
|
- **Poe** is the most complex: uses section tracking (poem_sections vs non_poem_sections) to only extract from the 4 actual poetry sections, skipping memoir, notes, prose, essays
|
|
|
|
|
|
## Letter Sources (11)
|
|
|
|
|
|
| Source | Gutenberg ID | Extractor |
|
|
|
|--------|-------------|-----------|
|
|
|
| Henry VIII to Anne Boleyn | 22009 | `extract_henry_viii` |
|
|
|
| Mary Wollstonecraft to Gilbert Imlay | 3529 | `extract_wollstonecraft` |
|
|
|
| Abelard & Héloïse | 35977 | `extract_abelard_heloise` |
|
|
|
| Napoleon to Josephine | 37499 | `extract_napoleon` |
|
|
|
| Keats to Fanny Brawne | 35698 | `extract_keats_brawne` |
|
|
|
| Browning Letters Vol 1 | 50400 | `_extract_browning` |
|
|
|
| Browning Letters Vol 2 | 51263 | `_extract_browning` |
|
|
|
| Burns to Clarinda | 6131 | `extract_burns_clarinda` |
|
|
|
| Dorothy Osborne | 34387 | `extract_dorothy_osborne` |
|
|
|
| Beethoven | 13065 | `extract_beethoven` |
|
|
|
| Mozart | 5307 | `extract_mozart` |
|
|
|
|
|
|
## Poetry Sources (15)
|
|
|
|
|
|
| Source | Gutenberg ID | Extractor |
|
|
|
|--------|-------------|-----------|
|
|
|
| Shakespeare Sonnets | 1041 | `extract_shakespeare_sonnets` |
|
|
|
| Emily Dickinson | 12242 | `extract_dickinson` |
|
|
|
| Walt Whitman | 1322 | `extract_whitman` |
|
|
|
| William Blake | 1934 | `extract_blake` |
|
|
|
| John Keats | 23684 | `extract_keats` |
|
|
|
| Edgar Allan Poe | 10031 | `extract_poe` |
|
|
|
| E.B. Browning Sonnets | 2002 | `extract_browning_sonnets` |
|
|
|
| T.S. Eliot | 1321 | `extract_eliot_wasteland` |
|
|
|
| Robert Frost (Mountain) | 29345 | `extract_frost_mountain` |
|
|
|
| Robert Frost (Selected) | 59824 | `extract_frost_selected` |
|
|
|
| W.B. Yeats | 32233 | `extract_yeats` |
|
|
|
| Omar Khayyám | 246 | `extract_khayyam` |
|
|
|
| Robert Burns | 1279 | `extract_burns` |
|
|
|
| William Wordsworth | 9622 | `extract_wordsworth` |
|
|
|
| Percy Shelley | 4800 | `extract_shelley` |
|
|
|
|
|
|
## Web UI Architecture
|
|
|
|
|
|
Both `/letters` and `/poetry` pages are **standalone SPAs** (no Jekyll dependency). They:
|
|
|
- Load a single combined JSON file via `fetch()`
|
|
|
- Match the site's dark neumorphism theme (bg `#2b2d2f`, text `#fff`)
|
|
|
- Letters uses red accent (`#ff073a`), Poetry uses purple accent (`#c084fc`)
|
|
|
- Feature: author/poet sidebar, card grid, random button, detail view with font controls
|
|
|
- Keyboard nav: ←/→ arrows, Escape to go back, R for random
|
|
|
- Font auto-fit: calculates ideal font size from container width and longest line length
|
|
|
- Manual A+/A− buttons override auto; click "auto" label to reset
|
|
|
|
|
|
## Change Log
|
|
|
|
|
|
### Fix Poe parser and add font size controls
|
|
|
- Rewrote `extract_poe()` with section tracking (poem_sections vs non_poem_sections)
|
|
|
- Only extracts from 4 poetry sections, skips memoir/notes/prose/essays/dedications
|
|
|
- Result: 51 clean poems (was 108 with junk)
|
|
|
- Added `_is_title()`, `_save_current()`, `_norm()` helper functions
|
|
|
- Added skip_titles set for sub-headings (PREFACE, dedications, etc.)
|
|
|
- Renames "Part I/II" → "Al Aaraaf — Part I/II"
|
|
|
|
|
|
### Add poetry collection
|
|
|
- Created `download_poetry.py` with 15 Gutenberg extractors
|
|
|
- 3,098 poems from 15 sources stored in `poetry/` as JSON
|
|
|
- Created `/poetry` web page matching site theme
|
|
|
|
|
|
### Remove letter truncation
|
|
|
- Removed `truncate_letter()` — shows full letter text
|
|
|
|
|
|
### Restructure: pre-downloaded letters
|
|
|
- Moved from download-on-run to pre-parsed JSON in `letters/`
|
|
|
- Added 6 new sources (Browning, Burns, Osborne, Beethoven, Mozart)
|
|
|
- 1,307 letters from 11 sources
|
|
|
|
|
|
### Initial love letters app
|
|
|
- Created `love_letters.py` CLI with 5 initial sources
|
|
|
- `download_letters.py` for fetching/parsing Gutenberg texts
|