Skip to content

Data Sources

Overview

The project uses Nobel Prize and academic publication data from multiple sources.

Primary Dataset (data/26963326/)

The core dataset contains structured data about Nobel Prize laureates and their scientific publications, sourced from OpenAlex and Nobel Prize archives.

CSV Tables (data/26963326/db_data/)

laureate.csv — Nobel Laureates

Field Type Description
laureate_id int Unique laureate identifier
name string Full name
gender string Gender
nationality string Country of nationality
education string Educational background
wikipedia_url string Wikipedia page URL
wikidata_id string Wikidata entity ID

Records: 757 laureates

award_info.csv — Award Details

Field Type Description
laureate_id int Laureate identifier
year int Award year
category_id int Category identifier
motivation string Award motivation text
prize_amount float Prize amount

Records: 761 awards

nobel_prize_category.csv — Prize Categories

Field Type Description
category_id int Category identifier
category_name string Category name

Categories: Physics, Chemistry, Physiology/Medicine, Economics

work.csv — Publications

Field Type Description
openalex_work_id string OpenAlex work identifier
title string Paper title
keywords string Semicolon-separated keywords
abstract_inverted_index JSON OpenAlex inverted index format
referenced_works string Cited work IDs
doi string Digital Object Identifier
publication_year int Publication year

Records: ~245,000 papers

work_authorship.csv — Authorship

Field Type Description
openalex_work_id string Work identifier
author_id string Author identifier
institution_id string Institution identifier
position string Author position (first, middle, last)

Records: ~1.67 million relationships

work_citation_by_year.csv — Citation Counts

Field Type Description
openalex_work_id string Work identifier
year int Citation year
cited_by_count int Number of citations in that year

Records: ~9 million entries

Other Tables

  • author.csv — ~297K OpenAlex author IDs
  • institution.csv — ~13K institutions (name, ROR, geo-location)
  • source.csv — ~11.5K journals/publishers
  • laureate_openalex_matching.csv — ~840 laureate ↔ OpenAlex ID matches
  • laureate_location_info.csv — ~2K laureate geo-information

JSON Data (data/26963326/json/)

publication_records.json (2.3 GB)

Line-delimited JSON (JSONL) containing full metadata for 253K papers:

{
  "id": "W2078536640",
  "title": "The ubiquitin-proteasome proteolytic pathway",
  "publication_year": 1998,
  "cited_by_count": 1250,
  "authorships": [...],
  "referenced_works": ["W...", "W..."],
  "abstract_inverted_index": {...},
  "keywords": [...]
}

laureate_info.json (924 KB)

Laureate details with geographic coordinates for birth, award, and death locations.

award_details.json (664 KB)

Award details including geographic coordinates.

External API Sources

OpenAlex API

  • Purpose: Supplement paper concepts, topics, and field classifications
  • Endpoint: https://api.openalex.org
  • Cache: output/openalex_cache/
  • Documentation: OpenAlex API Docs

Semantic Scholar API

  • Purpose: Fetch abstracts and TLDR summaries
  • Cache: output/openalex_cache/semantic_scholar/
  • Documentation: S2 API Docs

Unpaywall API

  • Purpose: Find Open Access PDF URLs
  • Cache: output/openalex_cache/unpaywall/
  • Documentation: Unpaywall API Docs

Data Coverage

Signal Coverage Notes
Paper titles ~100% Available for nearly all papers
Keywords ~70% Semicolon-separated keyword strings
Abstracts (inverted index) ~57% OpenAlex format, can be decoded
Citation network Most Referenced works available
Year-by-year citations ~9M records Comprehensive citation tracking

Known Gaps

Missing Data Impact Mitigation
Paper full text Cannot extract deep concept details Use abstracts + LLM augmentation
Subject classification Cannot directly categorize papers OpenAlex API + LLM inference
Concept ontology No pre-built hierarchy Build via LLM + seed concepts
Cross-field annotations No direct labels Citation network analysis + LLM