Skip to content

src/data_loader

Phase 1 module for loading and cleaning raw data.

Functions

decode_abstract(inverted_index: dict) → str

Reconstruct readable abstract text from OpenAlex inverted index format.

Parameters:

Name Type Description
inverted_index dict Mapping of words to position lists, e.g., {"the": [0, 5], "Nobel": [1]}

Returns: Reconstructed abstract string. Returns "" if input is empty/None.


load_laureates(data_dir: str) → pl.DataFrame

Load laureate.csv from the given directory.

Parameters:

Name Type Description
data_dir str Path to the db_data/ directory

Returns: Polars DataFrame with laureate records.


load_awards(data_dir: str) → pl.DataFrame

Load award_info.csv from the given directory.


load_categories(data_dir: str) → pl.DataFrame

Load nobel_prize_category.csv from the given directory.


load_laureate_matching(data_dir: str) → pl.DataFrame

Load laureate_openalex_matching.csv for laureate ↔ OpenAlex author ID mapping.


load_institutions(data_dir: str) → pl.DataFrame

Load institution.csv from the given directory.


load_publications_sample(json_path: str, laureate_ids: list, max_per_laureate: int = 100) → pl.DataFrame

Stream-read the JSONL publication records file, filtering by laureate IDs.

Parameters:

Name Type Default Description
json_path str Path to publication_records.json
laureate_ids list[int] Laureate IDs to filter for
max_per_laureate int 100 Maximum papers per laureate

Returns: Polars DataFrame with filtered publication records.


run(sample_mode: bool = True) → None

Execute the full Phase 1 pipeline:

  1. Load all CSV data
  2. Join awards with categories
  3. Load publications (sample or full)
  4. Decode abstracts
  5. Compute total citation counts
  6. Save 4 Parquet files to output/clean_data/

Parameters:

Name Type Default Description
sample_mode bool True If True, only process sample laureates