Skip to content

src/openalex_enricher

Phase 2a module for enriching publications with OpenAlex concepts, topics, and field classifications.

Functions

fetch_work_concepts(work_id: str, email: str, cache_dir: str) → dict

Fetch concept and topic data from the OpenAlex API for a single work.

Parameters:

Name Type Description
work_id str OpenAlex work ID (e.g., W2078536640)
email str Email for polite pool access
cache_dir str Directory for caching JSON responses

Returns: Dict with keys:

  • concepts — List of {name, score} concept entries
  • topics — List of topic classifications
  • field — Primary field name (e.g., "Physics")

Side effects: Caches response to {cache_dir}/{work_id}.json.


enrich_publications(publications: pl.DataFrame, cfg: dict) → pl.DataFrame

Batch-enrich publications DataFrame with OpenAlex data.

Parameters:

Name Type Description
publications pl.DataFrame Publications with openalex_work_id column
cfg dict Project configuration dict

Returns: Updated DataFrame with added columns:

  • openalex_concepts — JSON string of concepts list
  • openalex_topics — JSON string of topics list
  • openalex_field — Primary field string

run() → None

Execute the full Phase 2a pipeline:

  1. Load publications.parquet
  2. Fetch OpenAlex data for each paper
  3. Add concept/topic/field columns
  4. Overwrite publications.parquet