Color
Style
Mike McRae headshot
Mike McRae

Economist • Media • IO • Political Economy • AI

Projects

ML Pipeline

Southern Newswires - Extracting Historical Wire Articles

An open-source pipeline that digitises historical newspapers, reconstructs article layouts, and classifies wire services across millions of documents. The output dataset supports large-scale media-history research in the U.S. South (1960-1975).

Layout parsing YOLOv10 detection + rule-based bounding box merges.
Duplication ID Near-duplicate detection to collapse multi-paper wire repeats.
Newswire ID Fine-tuned BERT models for AP/UPI/NEA attribution.
LLM correction Minimal rewriting via Llama 3.2 to repair OCR errors.
Dataset snapshot 9.57M wires
58M Total articles
9.57M Wire articles
1960-1975 Coverage
Newswire proportions Topics over time