src/openalex_enricher¶
Phase 2a module for enriching publications with OpenAlex concepts, topics, and field classifications.
Functions¶
fetch_work_concepts(work_id: str, email: str, cache_dir: str) → dict¶
Fetch concept and topic data from the OpenAlex API for a single work.
Parameters:
| Name | Type | Description |
|---|---|---|
work_id |
str |
OpenAlex work ID (e.g., W2078536640) |
email |
str |
Email for polite pool access |
cache_dir |
str |
Directory for caching JSON responses |
Returns: Dict with keys:
concepts— List of{name, score}concept entriestopics— List of topic classificationsfield— Primary field name (e.g., "Physics")
Side effects: Caches response to {cache_dir}/{work_id}.json.
enrich_publications(publications: pl.DataFrame, cfg: dict) → pl.DataFrame¶
Batch-enrich publications DataFrame with OpenAlex data.
Parameters:
| Name | Type | Description |
|---|---|---|
publications |
pl.DataFrame |
Publications with openalex_work_id column |
cfg |
dict |
Project configuration dict |
Returns: Updated DataFrame with added columns:
openalex_concepts— JSON string of concepts listopenalex_topics— JSON string of topics listopenalex_field— Primary field string
run() → None¶
Execute the full Phase 2a pipeline:
- Load
publications.parquet - Fetch OpenAlex data for each paper
- Add concept/topic/field columns
- Overwrite
publications.parquet