Data Sources¶

Overview¶

The project uses Nobel Prize and academic publication data from multiple sources.

Primary Dataset (`data/26963326/`)¶

The core dataset contains structured data about Nobel Prize laureates and their scientific publications, sourced from OpenAlex and Nobel Prize archives.

CSV Tables (`data/26963326/db_data/`)¶

`laureate.csv` — Nobel Laureates¶

Field	Type	Description
`laureate_id`	int	Unique laureate identifier
`name`	string	Full name
`gender`	string	Gender
`nationality`	string	Country of nationality
`education`	string	Educational background
`wikipedia_url`	string	Wikipedia page URL
`wikidata_id`	string	Wikidata entity ID

Records: 757 laureates

`award_info.csv` — Award Details¶

Field	Type	Description
`laureate_id`	int	Laureate identifier
`year`	int	Award year
`category_id`	int	Category identifier
`motivation`	string	Award motivation text
`prize_amount`	float	Prize amount

Records: 761 awards

`nobel_prize_category.csv` — Prize Categories¶

Field	Type	Description
`category_id`	int	Category identifier
`category_name`	string	Category name

Categories: Physics, Chemistry, Physiology/Medicine, Economics

`work.csv` — Publications¶

Field	Type	Description
`openalex_work_id`	string	OpenAlex work identifier
`title`	string	Paper title
`keywords`	string	Semicolon-separated keywords
`abstract_inverted_index`	JSON	OpenAlex inverted index format
`referenced_works`	string	Cited work IDs
`doi`	string	Digital Object Identifier
`publication_year`	int	Publication year

Records: ~245,000 papers

`work_authorship.csv` — Authorship¶

Field	Type	Description
`openalex_work_id`	string	Work identifier
`author_id`	string	Author identifier
`institution_id`	string	Institution identifier
`position`	string	Author position (first, middle, last)

Records: ~1.67 million relationships

`work_citation_by_year.csv` — Citation Counts¶

Field	Type	Description
`openalex_work_id`	string	Work identifier
`year`	int	Citation year
`cited_by_count`	int	Number of citations in that year

Records: ~9 million entries

Other Tables¶

author.csv — ~297K OpenAlex author IDs
institution.csv — ~13K institutions (name, ROR, geo-location)
source.csv — ~11.5K journals/publishers
laureate_openalex_matching.csv — ~840 laureate ↔ OpenAlex ID matches
laureate_location_info.csv — ~2K laureate geo-information

JSON Data (`data/26963326/json/`)¶

`publication_records.json` (2.3 GB)¶

Line-delimited JSON (JSONL) containing full metadata for 253K papers:

{
  "id": "W2078536640",
  "title": "The ubiquitin-proteasome proteolytic pathway",
  "publication_year": 1998,
  "cited_by_count": 1250,
  "authorships": [...],
  "referenced_works": ["W...", "W..."],
  "abstract_inverted_index": {...},
  "keywords": [...]
}

`laureate_info.json` (924 KB)¶

Laureate details with geographic coordinates for birth, award, and death locations.

`award_details.json` (664 KB)¶

Award details including geographic coordinates.

External API Sources¶

OpenAlex API¶

Purpose: Supplement paper concepts, topics, and field classifications
Endpoint: https://api.openalex.org
Cache: output/openalex_cache/
Documentation: OpenAlex API Docs

Semantic Scholar API¶

Purpose: Fetch abstracts and TLDR summaries
Cache: output/openalex_cache/semantic_scholar/
Documentation: S2 API Docs

Unpaywall API¶

Purpose: Find Open Access PDF URLs
Cache: output/openalex_cache/unpaywall/
Documentation: Unpaywall API Docs

Data Coverage¶

Signal	Coverage	Notes
Paper titles	~100%	Available for nearly all papers
Keywords	~70%	Semicolon-separated keyword strings
Abstracts (inverted index)	~57%	OpenAlex format, can be decoded
Citation network	Most	Referenced works available
Year-by-year citations	~9M records	Comprehensive citation tracking

Known Gaps¶

Missing Data	Impact	Mitigation
Paper full text	Cannot extract deep concept details	Use abstracts + LLM augmentation
Subject classification	Cannot directly categorize papers	OpenAlex API + LLM inference
Concept ontology	No pre-built hierarchy	Build via LLM + seed concepts
Cross-field annotations	No direct labels	Citation network analysis + LLM

Data Sources¶

Overview¶

Primary Dataset (data/26963326/)¶

CSV Tables (data/26963326/db_data/)¶

laureate.csv — Nobel Laureates¶

award_info.csv — Award Details¶

nobel_prize_category.csv — Prize Categories¶

work.csv — Publications¶

work_authorship.csv — Authorship¶

work_citation_by_year.csv — Citation Counts¶