Phase 1.5: Content Enrichment¶

Module: src/content_enricher.py
Estimated Time: ~1 day

Objective¶

Fill in missing abstracts and obtain full-text content for key papers using three complementary strategies.

Enrichment Strategies¶

The system applies strategies in order, stopping once a paper has sufficient content:

Strategy A: Semantic Scholar API¶

Fetches abstract and TLDR (AI-generated summary) for each paper by DOI
Rate-limited to 1 request per second
Responses cached to output/openalex_cache/semantic_scholar/

Strategy B: Open Access PDF Download¶

Query Unpaywall API to find OA PDF URLs
Download PDF files
Extract text using PyMuPDF (with pdfplumber as fallback)
Cache PDFs and extracted text to output/openalex_cache/fulltext/

Strategy C: CORE API (Placeholder)¶

Reserved for future integration with the CORE API for additional full-text access.

Data Flow¶

flowchart LR
    PUB[publications.parquet] --> E{Enrich}
    E -->|Strategy A| SS[Semantic Scholar API]
    E -->|Strategy B| UP[Unpaywall → PDF → Text]
    E -->|Strategy C| CORE[CORE API]
    SS --> OUT[Updated publications.parquet]
    UP --> OUT
    CORE --> OUT

Added Columns¶

After enrichment, the following columns are added to publications.parquet:

Column	Type	Description
`abstract_enriched`	string	Best available abstract (original or from S2)
`tldr`	string	AI-generated one-line summary from Semantic Scholar
`fulltext`	string	Full paper text extracted from PDF
`content_source`	string	Source of enriched content (`original`, `semantic_scholar`, `oa_pdf`)

Caching¶

All API responses and downloaded files are cached to avoid redundant requests:

output/openalex_cache/
├── semantic_scholar/    # S2 API JSON responses
├── unpaywall/           # Unpaywall API JSON responses
├── pdfs/                # Downloaded PDF files
└── fulltext/            # Extracted text JSON files

Running¶

# Via pipeline
uv run python main.py --phase 1.5

# Standalone
uv run python -m src.content_enricher

# Skip this phase
uv run python main.py --skip-enrich