Data Sources¶
Overview¶
The project uses Nobel Prize and academic publication data from multiple sources.
Primary Dataset (data/26963326/)¶
The core dataset contains structured data about Nobel Prize laureates and their scientific publications, sourced from OpenAlex and Nobel Prize archives.
CSV Tables (data/26963326/db_data/)¶
laureate.csv — Nobel Laureates¶
| Field | Type | Description |
|---|---|---|
laureate_id |
int | Unique laureate identifier |
name |
string | Full name |
gender |
string | Gender |
nationality |
string | Country of nationality |
education |
string | Educational background |
wikipedia_url |
string | Wikipedia page URL |
wikidata_id |
string | Wikidata entity ID |
Records: 757 laureates
award_info.csv — Award Details¶
| Field | Type | Description |
|---|---|---|
laureate_id |
int | Laureate identifier |
year |
int | Award year |
category_id |
int | Category identifier |
motivation |
string | Award motivation text |
prize_amount |
float | Prize amount |
Records: 761 awards
nobel_prize_category.csv — Prize Categories¶
| Field | Type | Description |
|---|---|---|
category_id |
int | Category identifier |
category_name |
string | Category name |
Categories: Physics, Chemistry, Physiology/Medicine, Economics
work.csv — Publications¶
| Field | Type | Description |
|---|---|---|
openalex_work_id |
string | OpenAlex work identifier |
title |
string | Paper title |
keywords |
string | Semicolon-separated keywords |
abstract_inverted_index |
JSON | OpenAlex inverted index format |
referenced_works |
string | Cited work IDs |
doi |
string | Digital Object Identifier |
publication_year |
int | Publication year |
Records: ~245,000 papers
work_authorship.csv — Authorship¶
| Field | Type | Description |
|---|---|---|
openalex_work_id |
string | Work identifier |
author_id |
string | Author identifier |
institution_id |
string | Institution identifier |
position |
string | Author position (first, middle, last) |
Records: ~1.67 million relationships
work_citation_by_year.csv — Citation Counts¶
| Field | Type | Description |
|---|---|---|
openalex_work_id |
string | Work identifier |
year |
int | Citation year |
cited_by_count |
int | Number of citations in that year |
Records: ~9 million entries
Other Tables¶
author.csv— ~297K OpenAlex author IDsinstitution.csv— ~13K institutions (name, ROR, geo-location)source.csv— ~11.5K journals/publisherslaureate_openalex_matching.csv— ~840 laureate ↔ OpenAlex ID matcheslaureate_location_info.csv— ~2K laureate geo-information
JSON Data (data/26963326/json/)¶
publication_records.json (2.3 GB)¶
Line-delimited JSON (JSONL) containing full metadata for 253K papers:
{
"id": "W2078536640",
"title": "The ubiquitin-proteasome proteolytic pathway",
"publication_year": 1998,
"cited_by_count": 1250,
"authorships": [...],
"referenced_works": ["W...", "W..."],
"abstract_inverted_index": {...},
"keywords": [...]
}
laureate_info.json (924 KB)¶
Laureate details with geographic coordinates for birth, award, and death locations.
award_details.json (664 KB)¶
Award details including geographic coordinates.
External API Sources¶
OpenAlex API¶
- Purpose: Supplement paper concepts, topics, and field classifications
- Endpoint:
https://api.openalex.org - Cache:
output/openalex_cache/ - Documentation: OpenAlex API Docs
Semantic Scholar API¶
- Purpose: Fetch abstracts and TLDR summaries
- Cache:
output/openalex_cache/semantic_scholar/ - Documentation: S2 API Docs
Unpaywall API¶
- Purpose: Find Open Access PDF URLs
- Cache:
output/openalex_cache/unpaywall/ - Documentation: Unpaywall API Docs
Data Coverage¶
| Signal | Coverage | Notes |
|---|---|---|
| Paper titles | ~100% | Available for nearly all papers |
| Keywords | ~70% | Semicolon-separated keyword strings |
| Abstracts (inverted index) | ~57% | OpenAlex format, can be decoded |
| Citation network | Most | Referenced works available |
| Year-by-year citations | ~9M records | Comprehensive citation tracking |
Known Gaps¶
| Missing Data | Impact | Mitigation |
|---|---|---|
| Paper full text | Cannot extract deep concept details | Use abstracts + LLM augmentation |
| Subject classification | Cannot directly categorize papers | OpenAlex API + LLM inference |
| Concept ontology | No pre-built hierarchy | Build via LLM + seed concepts |
| Cross-field annotations | No direct labels | Citation network analysis + LLM |