src/content_enricher¶
Phase 1.5 module for enriching paper content via external APIs and OA PDFs.
Functions¶
fetch_semantic_scholar(doi: str, cache_dir: str) → dict¶
Fetch abstract and TLDR from the Semantic Scholar API.
Parameters:
| Name | Type | Description |
|---|---|---|
doi |
str |
Paper DOI |
cache_dir |
str |
Directory for caching responses |
Returns: Dict with keys abstract, tldr (may be None if not found).
Side effects: Caches response to {cache_dir}/semantic_scholar/{doi_sanitized}.json. Rate-limited to 1 req/s.
fetch_oa_fulltext(doi: str, pdf_url: str, cache_dir: str) → dict¶
Download an Open Access PDF and extract full text.
Parameters:
| Name | Type | Description |
|---|---|---|
doi |
str |
Paper DOI |
pdf_url |
str |
Direct URL to the OA PDF |
cache_dir |
str |
Directory for caching |
Returns: Dict with key fulltext (extracted text string).
_extract_text_from_pdf(pdf_path: str) → str¶
Extract text from a PDF file using PyMuPDF (primary) or pdfplumber (fallback).
Parameters:
| Name | Type | Description |
|---|---|---|
pdf_path |
str |
Path to the PDF file |
Returns: Extracted text string.
_get_unpaywall_pdf_url(doi: str, cache_dir: str) → str | None¶
Query the Unpaywall API to find an Open Access PDF URL for a given DOI.
Parameters:
| Name | Type | Description |
|---|---|---|
doi |
str |
Paper DOI |
cache_dir |
str |
Directory for caching responses |
Returns: PDF URL string, or None if no OA version found.
fetch_core_fulltext(doi: str, cache_dir: str) → dict¶
Fetch full text from the CORE API. Currently a placeholder implementation.
enrich_content(publications: pl.DataFrame, cfg: dict, strategies: list[str] = None) → pl.DataFrame¶
Main enrichment orchestrator. Applies enrichment strategies sequentially.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
publications |
pl.DataFrame |
— | Publications DataFrame |
cfg |
dict |
— | Project configuration dict |
strategies |
list[str] |
["semantic_scholar", "oa_pdf"] |
Strategies to apply in order |
Returns: Updated DataFrame with abstract_enriched, tldr, fulltext, content_source columns.
print_enrichment_stats(publications: pl.DataFrame) → None¶
Print statistics about enrichment coverage (how many papers were enriched by each strategy).
run() → None¶
Execute the full Phase 1.5 pipeline:
- Load
publications.parquet - Run enrichment strategies
- Print statistics
- Overwrite
publications.parquetwith enriched data