`src/content_enricher`¶

Phase 1.5 module for enriching paper content via external APIs and OA PDFs.

Functions¶

`fetch_semantic_scholar(doi: str, cache_dir: str) → dict`¶

Fetch abstract and TLDR from the Semantic Scholar API.

Parameters:

Name	Type	Description
`doi`	`str`	Paper DOI
`cache_dir`	`str`	Directory for caching responses

Returns: Dict with keys abstract, tldr (may be None if not found).

Side effects: Caches response to {cache_dir}/semantic_scholar/{doi_sanitized}.json. Rate-limited to 1 req/s.

`fetch_oa_fulltext(doi: str, pdf_url: str, cache_dir: str) → dict`¶

Download an Open Access PDF and extract full text.

Parameters:

Name	Type	Description
`doi`	`str`	Paper DOI
`pdf_url`	`str`	Direct URL to the OA PDF
`cache_dir`	`str`	Directory for caching

Returns: Dict with key fulltext (extracted text string).

`_extract_text_from_pdf(pdf_path: str) → str`¶

Extract text from a PDF file using PyMuPDF (primary) or pdfplumber (fallback).

Parameters:

Name	Type	Description
`pdf_path`	`str`	Path to the PDF file

Returns: Extracted text string.

`_get_unpaywall_pdf_url(doi: str, cache_dir: str) → str | None`¶

Query the Unpaywall API to find an Open Access PDF URL for a given DOI.

Parameters:

Name	Type	Description
`doi`	`str`	Paper DOI
`cache_dir`	`str`	Directory for caching responses

Returns: PDF URL string, or None if no OA version found.

`fetch_core_fulltext(doi: str, cache_dir: str) → dict`¶

Fetch full text from the CORE API. Currently a placeholder implementation.

`enrich_content(publications: pl.DataFrame, cfg: dict, strategies: list[str] = None) → pl.DataFrame`¶

Main enrichment orchestrator. Applies enrichment strategies sequentially.

Parameters:

Name	Type	Default	Description
`publications`	`pl.DataFrame`	—	Publications DataFrame
`cfg`	`dict`	—	Project configuration dict
`strategies`	`list[str]`	`["semantic_scholar", "oa_pdf"]`	Strategies to apply in order

Returns: Updated DataFrame with abstract_enriched, tldr, fulltext, content_source columns.

`print_enrichment_stats(publications: pl.DataFrame) → None`¶

Print statistics about enrichment coverage (how many papers were enriched by each strategy).

`run() → None`¶

Execute the full Phase 1.5 pipeline:

Load publications.parquet
Run enrichment strategies
Print statistics
Overwrite publications.parquet with enriched data

src/content_enricher¶

Functions¶

fetch_semantic_scholar(doi: str, cache_dir: str) → dict¶

fetch_oa_fulltext(doi: str, pdf_url: str, cache_dir: str) → dict¶

_extract_text_from_pdf(pdf_path: str) → str¶

_get_unpaywall_pdf_url(doi: str, cache_dir: str) → str | None¶

fetch_core_fulltext(doi: str, cache_dir: str) → dict¶

enrich_content(publications: pl.DataFrame, cfg: dict, strategies: list[str] = None) → pl.DataFrame¶

print_enrichment_stats(publications: pl.DataFrame) → None¶

run() → None¶

`src/content_enricher`¶

`fetch_semantic_scholar(doi: str, cache_dir: str) → dict`¶

`fetch_oa_fulltext(doi: str, pdf_url: str, cache_dir: str) → dict`¶

`_extract_text_from_pdf(pdf_path: str) → str`¶

`_get_unpaywall_pdf_url(doi: str, cache_dir: str) → str | None`¶

`fetch_core_fulltext(doi: str, cache_dir: str) → dict`¶

`enrich_content(publications: pl.DataFrame, cfg: dict, strategies: list[str] = None) → pl.DataFrame`¶

`print_enrichment_stats(publications: pl.DataFrame) → None`¶

`run() → None`¶