`src/data_loader`¶

Phase 1 module for loading and cleaning raw data.

Functions¶

`decode_abstract(inverted_index: dict) → str`¶

Reconstruct readable abstract text from OpenAlex inverted index format.

Parameters:

Name	Type	Description
`inverted_index`	`dict`	Mapping of words to position lists, e.g., `{"the": [0, 5], "Nobel": [1]}`

Returns: Reconstructed abstract string. Returns "" if input is empty/None.

`load_laureates(data_dir: str) → pl.DataFrame`¶

Load laureate.csv from the given directory.

Parameters:

Name	Type	Description
`data_dir`	`str`	Path to the `db_data/` directory

Returns: Polars DataFrame with laureate records.

`load_awards(data_dir: str) → pl.DataFrame`¶

Load award_info.csv from the given directory.

`load_categories(data_dir: str) → pl.DataFrame`¶

Load nobel_prize_category.csv from the given directory.

`load_laureate_matching(data_dir: str) → pl.DataFrame`¶

Load laureate_openalex_matching.csv for laureate ↔ OpenAlex author ID mapping.

`load_institutions(data_dir: str) → pl.DataFrame`¶

Load institution.csv from the given directory.

`load_publications_sample(json_path: str, laureate_ids: list, max_per_laureate: int = 100) → pl.DataFrame`¶

Stream-read the JSONL publication records file, filtering by laureate IDs.

Parameters:

Name	Type	Default	Description
`json_path`	`str`	—	Path to `publication_records.json`
`laureate_ids`	`list[int]`	—	Laureate IDs to filter for
`max_per_laureate`	`int`	`100`	Maximum papers per laureate

Returns: Polars DataFrame with filtered publication records.

`run(sample_mode: bool = True) → None`¶

Execute the full Phase 1 pipeline:

Load all CSV data
Join awards with categories
Load publications (sample or full)
Decode abstracts
Compute total citation counts
Save 4 Parquet files to output/clean_data/

Parameters:

Name	Type	Default	Description
`sample_mode`	`bool`	`True`	If True, only process sample laureates

src/data_loader¶

Functions¶

decode_abstract(inverted_index: dict) → str¶

load_laureates(data_dir: str) → pl.DataFrame¶

load_awards(data_dir: str) → pl.DataFrame¶

load_categories(data_dir: str) → pl.DataFrame¶

load_laureate_matching(data_dir: str) → pl.DataFrame¶

load_institutions(data_dir: str) → pl.DataFrame¶

load_publications_sample(json_path: str, laureate_ids: list, max_per_laureate: int = 100) → pl.DataFrame¶

run(sample_mode: bool = True) → None¶

`src/data_loader`¶

`decode_abstract(inverted_index: dict) → str`¶

`load_laureates(data_dir: str) → pl.DataFrame`¶

`load_awards(data_dir: str) → pl.DataFrame`¶

`load_categories(data_dir: str) → pl.DataFrame`¶

`load_laureate_matching(data_dir: str) → pl.DataFrame`¶

`load_institutions(data_dir: str) → pl.DataFrame`¶

`load_publications_sample(json_path: str, laureate_ids: list, max_per_laureate: int = 100) → pl.DataFrame`¶

`run(sample_mode: bool = True) → None`¶