Skip to content

Phase 1.5: Content Enrichment

Module: src/content_enricher.py
Estimated Time: ~1 day

Objective

Fill in missing abstracts and obtain full-text content for key papers using three complementary strategies.

Enrichment Strategies

The system applies strategies in order, stopping once a paper has sufficient content:

Strategy A: Semantic Scholar API

  • Fetches abstract and TLDR (AI-generated summary) for each paper by DOI
  • Rate-limited to 1 request per second
  • Responses cached to output/openalex_cache/semantic_scholar/

Strategy B: Open Access PDF Download

  1. Query Unpaywall API to find OA PDF URLs
  2. Download PDF files
  3. Extract text using PyMuPDF (with pdfplumber as fallback)
  4. Cache PDFs and extracted text to output/openalex_cache/fulltext/

Strategy C: CORE API (Placeholder)

Reserved for future integration with the CORE API for additional full-text access.

Data Flow

flowchart LR
    PUB[publications.parquet] --> E{Enrich}
    E -->|Strategy A| SS[Semantic Scholar API]
    E -->|Strategy B| UP[Unpaywall → PDF → Text]
    E -->|Strategy C| CORE[CORE API]
    SS --> OUT[Updated publications.parquet]
    UP --> OUT
    CORE --> OUT

Added Columns

After enrichment, the following columns are added to publications.parquet:

Column Type Description
abstract_enriched string Best available abstract (original or from S2)
tldr string AI-generated one-line summary from Semantic Scholar
fulltext string Full paper text extracted from PDF
content_source string Source of enriched content (original, semantic_scholar, oa_pdf)

Caching

All API responses and downloaded files are cached to avoid redundant requests:

output/openalex_cache/
├── semantic_scholar/    # S2 API JSON responses
├── unpaywall/           # Unpaywall API JSON responses
├── pdfs/                # Downloaded PDF files
└── fulltext/            # Extracted text JSON files

Running

# Via pipeline
uv run python main.py --phase 1.5

# Standalone
uv run python -m src.content_enricher

# Skip this phase
uv run python main.py --skip-enrich