Projects
ML Pipeline
Southern Newswires - Extracting Historical Wire Articles
An open-source pipeline that digitises historical newspapers, reconstructs article layouts, and classifies wire services across millions of documents. The output dataset supports large-scale media-history research in the U.S. South (1960-1975).
Layout parsing YOLOv10 detection + rule-based bounding box merges.
Duplication ID Near-duplicate detection to collapse multi-paper wire repeats.
Newswire ID Fine-tuned BERT models for AP/UPI/NEA attribution.
LLM correction Minimal rewriting via Llama 3.2 to repair OCR errors.
Dataset snapshot 9.57M wires
58M Total articles
9.57M Wire articles
1960-1975 Coverage
Semantic Matcher
TRISS Researcher Similarity
A discovery suite that maps who is working on similar problems inside TRISS. The semantic matcher compares researcher-level abstract embeddings, then surfaces the closest neighbours along with the top publications that explain the overlap.
Similarity engine Offline cosine similarity over consolidated abstract embeddings.
Explainability Top-5 publication matches reveal the shared themes.
UMAP explorer Interactive 2D topic maps for TRISS research outputs.
Semantic search Live preview
Search by name or email
Mike McRae
Matching 327 TRISS researchers Ciara O'Neil Education Policy
92.4% Tommy Walsh EdTech & Inclusion
89.7% Sara Byrne Mathematics Education
87.5%