Phase 1: Data Loading & Cleaning¶
Module: src/data_loader.py
Estimated Time: ~2 days (full dataset)
Objective¶
Load all raw CSV and JSON data into Python, decode OpenAlex inverted-index abstracts, clean the data, and output structured Parquet files for downstream processing.
Input Data¶
CSV Tables (data/26963326/db_data/)¶
| File | Records | Key Fields |
|---|---|---|
laureate.csv |
757 | Name, gender, nationality, education, Wikipedia/Wikidata |
award_info.csv |
761 | Year, category, motivation, prize amount |
nobel_prize_category.csv |
4 | Economics, Physics, Chemistry, Physiology/Medicine |
laureate_openalex_matching.csv |
~840 | Laureate ↔ OpenAlex author ID mapping |
institution.csv |
~13K | Name, ROR, geographic location |
work.csv |
~245K | Title, keywords, abstract (inverted index), citations, DOI, year |
work_authorship.csv |
~1.67M | Paper-author-institution relationships |
work_citation_by_year.csv |
~9M | Year-by-year citation counts |
JSON Data (data/26963326/json/)¶
| File | Size | Content |
|---|---|---|
publication_records.json |
2.3 GB | 253K papers as JSONL with full metadata |
laureate_info.json |
924 KB | Laureate details with geo-coordinates |
award_details.json |
664 KB | Award details with geo-coordinates |
Processing Steps¶
Step 1: Load CSV Data¶
All CSV files are loaded with Polars for high-performance columnar processing:
laureates = pl.read_csv(data_dir / "laureate.csv")
awards = pl.read_csv(data_dir / "award_info.csv")
categories = pl.read_csv(data_dir / "nobel_prize_category.csv")
matching = pl.read_csv(data_dir / "laureate_openalex_matching.csv")
institutions = pl.read_csv(data_dir / "institution.csv")
Step 2: Decode Abstracts¶
OpenAlex stores abstracts as inverted indices — a mapping from each word to its position(s) in the text. The decode_abstract() function reconstructs readable text:
# Input: {"the": [0, 5], "Nobel": [1], "Prize": [2], ...}
# Output: "the Nobel Prize ... the ..."
def decode_abstract(inverted_index: dict) -> str:
if not inverted_index:
return ""
words = {}
for word, positions in inverted_index.items():
for pos in positions:
words[pos] = word
return " ".join(words[i] for i in sorted(words.keys()))
Step 3: Stream Load Publications¶
The 2.3 GB JSONL file is loaded via streaming with per-laureate filtering:
def load_publications_sample(json_path, laureate_ids, max_per_laureate=100):
"""Stream-read JSONL, filter by laureate, limit per person."""
Step 4: Data Joining¶
Awards are joined with categories to add field names:
awards + categories → awards with field/category names
laureates + matching → laureates with OpenAlex IDs
Step 5: Output Parquet¶
Clean data is saved as Parquet files for efficient downstream processing:
output/clean_data/laureates.parquetoutput/clean_data/awards.parquetoutput/clean_data/publications.parquetoutput/clean_data/institutions.parquet
Sample Mode¶
In sample mode (default), only 5 representative laureates are processed:
This dramatically reduces processing time and output size while covering all 4 Nobel Prize fields.