Skip to content

Phase 1: Data Loading & Cleaning

Module: src/data_loader.py
Estimated Time: ~2 days (full dataset)

Objective

Load all raw CSV and JSON data into Python, decode OpenAlex inverted-index abstracts, clean the data, and output structured Parquet files for downstream processing.

Input Data

CSV Tables (data/26963326/db_data/)

File Records Key Fields
laureate.csv 757 Name, gender, nationality, education, Wikipedia/Wikidata
award_info.csv 761 Year, category, motivation, prize amount
nobel_prize_category.csv 4 Economics, Physics, Chemistry, Physiology/Medicine
laureate_openalex_matching.csv ~840 Laureate ↔ OpenAlex author ID mapping
institution.csv ~13K Name, ROR, geographic location
work.csv ~245K Title, keywords, abstract (inverted index), citations, DOI, year
work_authorship.csv ~1.67M Paper-author-institution relationships
work_citation_by_year.csv ~9M Year-by-year citation counts

JSON Data (data/26963326/json/)

File Size Content
publication_records.json 2.3 GB 253K papers as JSONL with full metadata
laureate_info.json 924 KB Laureate details with geo-coordinates
award_details.json 664 KB Award details with geo-coordinates

Processing Steps

Step 1: Load CSV Data

All CSV files are loaded with Polars for high-performance columnar processing:

laureates = pl.read_csv(data_dir / "laureate.csv")
awards = pl.read_csv(data_dir / "award_info.csv")
categories = pl.read_csv(data_dir / "nobel_prize_category.csv")
matching = pl.read_csv(data_dir / "laureate_openalex_matching.csv")
institutions = pl.read_csv(data_dir / "institution.csv")

Step 2: Decode Abstracts

OpenAlex stores abstracts as inverted indices — a mapping from each word to its position(s) in the text. The decode_abstract() function reconstructs readable text:

# Input: {"the": [0, 5], "Nobel": [1], "Prize": [2], ...}
# Output: "the Nobel Prize ... the ..."

def decode_abstract(inverted_index: dict) -> str:
    if not inverted_index:
        return ""
    words = {}
    for word, positions in inverted_index.items():
        for pos in positions:
            words[pos] = word
    return " ".join(words[i] for i in sorted(words.keys()))

Step 3: Stream Load Publications

The 2.3 GB JSONL file is loaded via streaming with per-laureate filtering:

def load_publications_sample(json_path, laureate_ids, max_per_laureate=100):
    """Stream-read JSONL, filter by laureate, limit per person."""

Step 4: Data Joining

Awards are joined with categories to add field names:

awards + categories → awards with field/category names
laureates + matching → laureates with OpenAlex IDs

Step 5: Output Parquet

Clean data is saved as Parquet files for efficient downstream processing:

  • output/clean_data/laureates.parquet
  • output/clean_data/awards.parquet
  • output/clean_data/publications.parquet
  • output/clean_data/institutions.parquet

Sample Mode

In sample mode (default), only 5 representative laureates are processed:

SAMPLE_LAUREATE_IDS = [745, 102, 779, 114, 843]

This dramatically reduces processing time and output size while covering all 4 Nobel Prize fields.

Running

# Via pipeline
uv run python main.py --phase 1

# Standalone
uv run python -m src.data_loader