Phase 1.5: Content Enrichment¶
Module: src/content_enricher.py
Estimated Time: ~1 day
Objective¶
Fill in missing abstracts and obtain full-text content for key papers using three complementary strategies.
Enrichment Strategies¶
The system applies strategies in order, stopping once a paper has sufficient content:
Strategy A: Semantic Scholar API¶
- Fetches
abstractandTLDR(AI-generated summary) for each paper by DOI - Rate-limited to 1 request per second
- Responses cached to
output/openalex_cache/semantic_scholar/
Strategy B: Open Access PDF Download¶
- Query Unpaywall API to find OA PDF URLs
- Download PDF files
- Extract text using PyMuPDF (with pdfplumber as fallback)
- Cache PDFs and extracted text to
output/openalex_cache/fulltext/
Strategy C: CORE API (Placeholder)¶
Reserved for future integration with the CORE API for additional full-text access.
Data Flow¶
flowchart LR
PUB[publications.parquet] --> E{Enrich}
E -->|Strategy A| SS[Semantic Scholar API]
E -->|Strategy B| UP[Unpaywall → PDF → Text]
E -->|Strategy C| CORE[CORE API]
SS --> OUT[Updated publications.parquet]
UP --> OUT
CORE --> OUT
Added Columns¶
After enrichment, the following columns are added to publications.parquet:
| Column | Type | Description |
|---|---|---|
abstract_enriched |
string | Best available abstract (original or from S2) |
tldr |
string | AI-generated one-line summary from Semantic Scholar |
fulltext |
string | Full paper text extracted from PDF |
content_source |
string | Source of enriched content (original, semantic_scholar, oa_pdf) |
Caching¶
All API responses and downloaded files are cached to avoid redundant requests:
output/openalex_cache/
├── semantic_scholar/ # S2 API JSON responses
├── unpaywall/ # Unpaywall API JSON responses
├── pdfs/ # Downloaded PDF files
└── fulltext/ # Extracted text JSON files