Phase 1: Data Loading & Cleaning¶

Module: src/data_loader.py
Estimated Time: ~2 days (full dataset)

Objective¶

Load all raw CSV and JSON data into Python, decode OpenAlex inverted-index abstracts, clean the data, and output structured Parquet files for downstream processing.

Input Data¶

CSV Tables (`data/26963326/db_data/`)¶

File	Records	Key Fields
`laureate.csv`	757	Name, gender, nationality, education, Wikipedia/Wikidata
`award_info.csv`	761	Year, category, motivation, prize amount
`nobel_prize_category.csv`	4	Economics, Physics, Chemistry, Physiology/Medicine
`laureate_openalex_matching.csv`	~840	Laureate ↔ OpenAlex author ID mapping
`institution.csv`	~13K	Name, ROR, geographic location
`work.csv`	~245K	Title, keywords, abstract (inverted index), citations, DOI, year
`work_authorship.csv`	~1.67M	Paper-author-institution relationships
`work_citation_by_year.csv`	~9M	Year-by-year citation counts

JSON Data (`data/26963326/json/`)¶

File	Size	Content
`publication_records.json`	2.3 GB	253K papers as JSONL with full metadata
`laureate_info.json`	924 KB	Laureate details with geo-coordinates
`award_details.json`	664 KB	Award details with geo-coordinates

Processing Steps¶

Step 1: Load CSV Data¶

All CSV files are loaded with Polars for high-performance columnar processing:

laureates = pl.read_csv(data_dir / "laureate.csv")
awards = pl.read_csv(data_dir / "award_info.csv")
categories = pl.read_csv(data_dir / "nobel_prize_category.csv")
matching = pl.read_csv(data_dir / "laureate_openalex_matching.csv")
institutions = pl.read_csv(data_dir / "institution.csv")

Step 2: Decode Abstracts¶

OpenAlex stores abstracts as inverted indices — a mapping from each word to its position(s) in the text. The decode_abstract() function reconstructs readable text:

# Input: {"the": [0, 5], "Nobel": [1], "Prize": [2], ...}
# Output: "the Nobel Prize ... the ..."

def decode_abstract(inverted_index: dict) -> str:
    if not inverted_index:
        return ""
    words = {}
    for word, positions in inverted_index.items():
        for pos in positions:
            words[pos] = word
    return " ".join(words[i] for i in sorted(words.keys()))

Step 3: Stream Load Publications¶

The 2.3 GB JSONL file is loaded via streaming with per-laureate filtering:

def load_publications_sample(json_path, laureate_ids, max_per_laureate=100):
    """Stream-read JSONL, filter by laureate, limit per person."""

Step 4: Data Joining¶

Awards are joined with categories to add field names:

awards + categories → awards with field/category names
laureates + matching → laureates with OpenAlex IDs

Step 5: Output Parquet¶

Clean data is saved as Parquet files for efficient downstream processing:

output/clean_data/laureates.parquet
output/clean_data/awards.parquet
output/clean_data/publications.parquet
output/clean_data/institutions.parquet

Sample Mode¶

In sample mode (default), only 5 representative laureates are processed:

SAMPLE_LAUREATE_IDS = [745, 102, 779, 114, 843]

This dramatically reduces processing time and output size while covering all 4 Nobel Prize fields.

Running¶

# Via pipeline
uv run python main.py --phase 1

# Standalone
uv run python -m src.data_loader