Skip to content

src/content_enricher

Phase 1.5 module for enriching paper content via external APIs and OA PDFs.

Functions

fetch_semantic_scholar(doi: str, cache_dir: str) → dict

Fetch abstract and TLDR from the Semantic Scholar API.

Parameters:

Name Type Description
doi str Paper DOI
cache_dir str Directory for caching responses

Returns: Dict with keys abstract, tldr (may be None if not found).

Side effects: Caches response to {cache_dir}/semantic_scholar/{doi_sanitized}.json. Rate-limited to 1 req/s.


fetch_oa_fulltext(doi: str, pdf_url: str, cache_dir: str) → dict

Download an Open Access PDF and extract full text.

Parameters:

Name Type Description
doi str Paper DOI
pdf_url str Direct URL to the OA PDF
cache_dir str Directory for caching

Returns: Dict with key fulltext (extracted text string).


_extract_text_from_pdf(pdf_path: str) → str

Extract text from a PDF file using PyMuPDF (primary) or pdfplumber (fallback).

Parameters:

Name Type Description
pdf_path str Path to the PDF file

Returns: Extracted text string.


_get_unpaywall_pdf_url(doi: str, cache_dir: str) → str | None

Query the Unpaywall API to find an Open Access PDF URL for a given DOI.

Parameters:

Name Type Description
doi str Paper DOI
cache_dir str Directory for caching responses

Returns: PDF URL string, or None if no OA version found.


fetch_core_fulltext(doi: str, cache_dir: str) → dict

Fetch full text from the CORE API. Currently a placeholder implementation.


enrich_content(publications: pl.DataFrame, cfg: dict, strategies: list[str] = None) → pl.DataFrame

Main enrichment orchestrator. Applies enrichment strategies sequentially.

Parameters:

Name Type Default Description
publications pl.DataFrame Publications DataFrame
cfg dict Project configuration dict
strategies list[str] ["semantic_scholar", "oa_pdf"] Strategies to apply in order

Returns: Updated DataFrame with abstract_enriched, tldr, fulltext, content_source columns.


Print statistics about enrichment coverage (how many papers were enriched by each strategy).


run() → None

Execute the full Phase 1.5 pipeline:

  1. Load publications.parquet
  2. Run enrichment strategies
  3. Print statistics
  4. Overwrite publications.parquet with enriched data