src/data_loader¶
Phase 1 module for loading and cleaning raw data.
Functions¶
decode_abstract(inverted_index: dict) → str¶
Reconstruct readable abstract text from OpenAlex inverted index format.
Parameters:
| Name | Type | Description |
|---|---|---|
inverted_index |
dict |
Mapping of words to position lists, e.g., {"the": [0, 5], "Nobel": [1]} |
Returns: Reconstructed abstract string. Returns "" if input is empty/None.
load_laureates(data_dir: str) → pl.DataFrame¶
Load laureate.csv from the given directory.
Parameters:
| Name | Type | Description |
|---|---|---|
data_dir |
str |
Path to the db_data/ directory |
Returns: Polars DataFrame with laureate records.
load_awards(data_dir: str) → pl.DataFrame¶
Load award_info.csv from the given directory.
load_categories(data_dir: str) → pl.DataFrame¶
Load nobel_prize_category.csv from the given directory.
load_laureate_matching(data_dir: str) → pl.DataFrame¶
Load laureate_openalex_matching.csv for laureate ↔ OpenAlex author ID mapping.
load_institutions(data_dir: str) → pl.DataFrame¶
Load institution.csv from the given directory.
load_publications_sample(json_path: str, laureate_ids: list, max_per_laureate: int = 100) → pl.DataFrame¶
Stream-read the JSONL publication records file, filtering by laureate IDs.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
json_path |
str |
— | Path to publication_records.json |
laureate_ids |
list[int] |
— | Laureate IDs to filter for |
max_per_laureate |
int |
100 |
Maximum papers per laureate |
Returns: Polars DataFrame with filtered publication records.
run(sample_mode: bool = True) → None¶
Execute the full Phase 1 pipeline:
- Load all CSV data
- Join awards with categories
- Load publications (sample or full)
- Decode abstracts
- Compute total citation counts
- Save 4 Parquet files to
output/clean_data/
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
sample_mode |
bool |
True |
If True, only process sample laureates |