Add CLAUDE.md project documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
main
Ibraheem Saleh 1 week ago
parent 49d11e8869
commit 6b308d3a01

@ -0,0 +1,142 @@
# CLAUDE.md — Letters & Poetry Project
## Project Overview
A Python project that downloads, parses, and displays historic love letters and classic poetry from Project Gutenberg. Data is pre-parsed and committed to git so end users don't need to download anything. A web UI is served via the `hicalsoft.github.io` GitHub Pages site.
## Repository Structure
```
├── download_letters.py # Downloads & parses 11 letter sources from Gutenberg
├── download_poetry.py # Downloads & parses 15 poetry sources from Gutenberg
├── love_letters.py # CLI app: displays random letters in the terminal
├── letters/ # Pre-parsed letter JSON files (11 sources, ~1,307 letters)
├── poetry/ # Pre-parsed poetry JSON files (15 sources, ~3,098 poems)
├── hicalsoft.github.io/ # Embedded repo for GitHub Pages web UI (separate git history)
│ ├── letters/index.html # Standalone SPA for browsing letters
│ ├── letters/data/letters.json
│ ├── poetry/index.html # Standalone SPA for browsing poetry
│ └── poetry/data/poetry.json
└── README.md
```
## Key Commands
```bash
# Download all letter sources (requires internet)
python3 download_letters.py
# Download all poetry sources (requires internet)
python3 download_poetry.py
# Run CLI app (no internet needed — reads from letters/ directory)
python3 love_letters.py # Random letter
python3 love_letters.py --list # List sources
python3 love_letters.py --source napoleon # Filter by source
```
## Data Pipeline
1. **Download scripts** fetch raw `.txt` files from Project Gutenberg via `urllib`
2. Each source has a **custom extractor function** that parses the Gutenberg text format
3. Parsed data is saved as JSON to `letters/` or `poetry/` directories
4. A separate step combines the individual JSON files into `letters.json` / `poetry.json` for the web UI (these live in `hicalsoft.github.io/*/data/`)
### Regenerating web UI data
```python
# Letters — run from project root
import json, os, glob
out = {"authors": {}, "letters": []}
for f in sorted(glob.glob("letters/*.json")):
data = json.load(open(f))
for l in data:
out["letters"].append({"a": l["author"], "r": l["recipient"], "h": l.get("heading",""), "b": l["body"], "s": l["source"], "p": l.get("period","")})
out["authors"].setdefault(l["author"], 0)
out["authors"][l["author"]] += 1
json.dump(out, open("hicalsoft.github.io/letters/data/letters.json","w"), separators=(",",":"))
# Poetry — same pattern with {authors, poems} structure
```
## Gutenberg Parsing Notes
- **Line endings**: Always normalize with `.replace("\r\n", "\n").replace("\r", "\n")` before regex splitting
- **START/END markers** vary: `"*** START OF THE PROJECT GUTENBERG EBOOK"`, `"*** START OF THIS PROJECT GUTENBERG EBOOK"`, etc. — use regex
- **Each source needs a custom extractor** due to unique formatting (Roman numerals, ALL CAPS titles, numbered entries, etc.)
- **CONTENTS sections** often duplicate the same headings as actual content — need to find the 2nd occurrence or verify context
- **Poe** is the most complex: uses section tracking (poem_sections vs non_poem_sections) to only extract from the 4 actual poetry sections, skipping memoir, notes, prose, essays
## Letter Sources (11)
| Source | Gutenberg ID | Extractor |
|--------|-------------|-----------|
| Henry VIII to Anne Boleyn | 22009 | `extract_henry_viii` |
| Mary Wollstonecraft to Gilbert Imlay | 3529 | `extract_wollstonecraft` |
| Abelard & Héloïse | 35977 | `extract_abelard_heloise` |
| Napoleon to Josephine | 37499 | `extract_napoleon` |
| Keats to Fanny Brawne | 35698 | `extract_keats_brawne` |
| Browning Letters Vol 1 | 50400 | `_extract_browning` |
| Browning Letters Vol 2 | 51263 | `_extract_browning` |
| Burns to Clarinda | 6131 | `extract_burns_clarinda` |
| Dorothy Osborne | 34387 | `extract_dorothy_osborne` |
| Beethoven | 13065 | `extract_beethoven` |
| Mozart | 5307 | `extract_mozart` |
## Poetry Sources (15)
| Source | Gutenberg ID | Extractor |
|--------|-------------|-----------|
| Shakespeare Sonnets | 1041 | `extract_shakespeare_sonnets` |
| Emily Dickinson | 12242 | `extract_dickinson` |
| Walt Whitman | 1322 | `extract_whitman` |
| William Blake | 1934 | `extract_blake` |
| John Keats | 23684 | `extract_keats` |
| Edgar Allan Poe | 10031 | `extract_poe` |
| E.B. Browning Sonnets | 2002 | `extract_browning_sonnets` |
| T.S. Eliot | 1321 | `extract_eliot_wasteland` |
| Robert Frost (Mountain) | 29345 | `extract_frost_mountain` |
| Robert Frost (Selected) | 59824 | `extract_frost_selected` |
| W.B. Yeats | 32233 | `extract_yeats` |
| Omar Khayyám | 246 | `extract_khayyam` |
| Robert Burns | 1279 | `extract_burns` |
| William Wordsworth | 9622 | `extract_wordsworth` |
| Percy Shelley | 4800 | `extract_shelley` |
## Web UI Architecture
Both `/letters` and `/poetry` pages are **standalone SPAs** (no Jekyll dependency). They:
- Load a single combined JSON file via `fetch()`
- Match the site's dark neumorphism theme (bg `#2b2d2f`, text `#fff`)
- Letters uses red accent (`#ff073a`), Poetry uses purple accent (`#c084fc`)
- Feature: author/poet sidebar, card grid, random button, detail view with font controls
- Keyboard nav: ←/→ arrows, Escape to go back, R for random
- Font auto-fit: calculates ideal font size from container width and longest line length
- Manual A+/A buttons override auto; click "auto" label to reset
## Change Log
### Fix Poe parser and add font size controls
- Rewrote `extract_poe()` with section tracking (poem_sections vs non_poem_sections)
- Only extracts from 4 poetry sections, skips memoir/notes/prose/essays/dedications
- Result: 51 clean poems (was 108 with junk)
- Added `_is_title()`, `_save_current()`, `_norm()` helper functions
- Added skip_titles set for sub-headings (PREFACE, dedications, etc.)
- Renames "Part I/II" → "Al Aaraaf — Part I/II"
### Add poetry collection
- Created `download_poetry.py` with 15 Gutenberg extractors
- 3,098 poems from 15 sources stored in `poetry/` as JSON
- Created `/poetry` web page matching site theme
### Remove letter truncation
- Removed `truncate_letter()` — shows full letter text
### Restructure: pre-downloaded letters
- Moved from download-on-run to pre-parsed JSON in `letters/`
- Added 6 new sources (Browning, Burns, Osborne, Beethoven, Mozart)
- 1,307 letters from 11 sources
### Initial love letters app
- Created `love_letters.py` CLI with 5 initial sources
- `download_letters.py` for fetching/parsing Gutenberg texts

@ -1 +1 @@
Subproject commit 7fd16c08b20d81da497d2efb44af2e83860382f4 Subproject commit c292f8eed1677afa7de015a8a32098f7b3f52956
Loading…
Cancel
Save