Architecture¶
System Overview¶
The Nobel Prize Knowledge Graph is built as a multi-phase data pipeline that transforms raw Nobel Prize and publication data into an interactive, queryable knowledge graph.
flowchart TB
subgraph Input["Input Data"]
CSV[CSV Tables<br>757 laureates, 245K papers]
JSON[JSON Files<br>2.3GB publication records]
end
subgraph Phase1["Phase 1: Data Foundation"]
DL[data_loader.py<br>Load & Clean]
CE[content_enricher.py<br>Enrich Content]
end
subgraph Phase2["Phase 2: Concept Extraction"]
OA[openalex_enricher.py<br>API Enrichment]
LLM[concept_extractor.py<br>LLM Extraction]
end
subgraph Phase34["Phase 3+4: Graph Construction"]
GB[graph_builder.py<br>Build Knowledge Graph]
end
subgraph Phase5["Phase 5: Output"]
VIZ[visualize.py<br>Interactive HTML]
IA[insight_analyzer.py<br>Analysis Reports]
end
CSV --> DL
JSON --> DL
DL --> CE
CE --> OA
OA --> LLM
LLM --> GB
GB --> VIZ
GB --> IA
style Input fill:#fff3e0
style Phase1 fill:#e3f2fd
style Phase2 fill:#f3e5f5
style Phase34 fill:#e8f5e9
style Phase5 fill:#fce4ec
Data Flow¶
Phase-by-Phase Data Transformation¶
| Phase | Input | Process | Output |
|---|---|---|---|
| 1 | CSV + JSONL | Load, decode abstracts, clean | laureates.parquet, awards.parquet, publications.parquet, institutions.parquet |
| 1.5 | publications.parquet |
Semantic Scholar API, OA PDFs | Updated publications.parquet with enriched abstracts/fulltext |
| 2a | publications.parquet |
OpenAlex API | Updated publications.parquet with concepts/topics/fields |
| 2b | publications.parquet + awards.parquet |
GPT-4o-mini LLM | llm_extraction_raw.json, concepts.parquet |
| 3+4 | All Parquet + concepts | NetworkX graph construction | knowledge_graph.json, nodes.json, edges.json, .graphml |
| 5 | knowledge_graph.json |
Pyvis + Plotly | network.html, timeline.html, heatmap.html |
| 6 | knowledge_graph.json |
Graph analysis algorithms | insight_report.md, insight_report.json |
Storage Formats¶
flowchart LR
RAW[Raw Data] -->|CSV/JSON| PARQUET[Parquet Files]
PARQUET --> KG[Knowledge Graph]
KG -->|JSON| JSON_OUT[knowledge_graph.json]
KG -->|GraphML| GRAPHML[knowledge_graph.graphml]
KG -->|Separate| NODES[nodes.json + edges.json]
KG -->|HTML| VIZ[Interactive Visualizations]
Core Design Decisions¶
1. Tiered Concept Extraction Strategy¶
Not all 245K papers receive the same treatment. The system uses a three-tier approach to balance cost and quality:
| Tier | Papers | Method | Cost |
|---|---|---|---|
| Tier 1 | ~1,000 award-related + top-cited | LLM deep extraction | High |
| Tier 2 | ~30,000 highly-cited | Keywords + OpenAlex concepts | Medium |
| Tier 3 | Remaining | Keyword-only coarse tagging | Low |
2. Sample Mode for Development¶
The pipeline supports a sample mode that processes only 5 representative laureates. This enables rapid iteration without processing the full 2.3GB dataset.
3. Aggressive Caching¶
All external API calls (OpenAlex, Semantic Scholar, Unpaywall) are cached to JSON files:
output/openalex_cache/ # OpenAlex work data
output/openalex_cache/fulltext/ # Full text from various sources
output/openalex_cache/semantic_scholar/ # S2 API responses
output/openalex_cache/unpaywall/ # Unpaywall OA links
4. Streaming Large Files¶
The 2.3GB publication_records.json is processed as line-delimited JSON (JSONL) with streaming reads, avoiding full file loading into memory.
5. NetworkX as Primary Graph Engine¶
NetworkX was chosen as the graph engine for its:
- Rich algorithm library (shortest paths, centrality, community detection)
- Easy serialization to JSON and GraphML
- No external database dependency
For production scaling, the schema is designed to be compatible with Neo4j.
Module Dependency Graph¶
flowchart TD
INIT[src/__init__.py<br>Config Loader] --> DL
INIT --> CE
INIT --> OA
INIT --> CX
INIT --> GB
INIT --> VIZ
INIT --> IA
DL[data_loader.py] --> CE[content_enricher.py]
CE --> OA[openalex_enricher.py]
OA --> CX[concept_extractor.py]
DL --> GB[graph_builder.py]
CX --> GB
GB --> VIZ[visualize.py]
GB --> IA[insight_analyzer.py]
MAIN[main.py<br>Pipeline Runner] -.->|orchestrates| DL
MAIN -.-> CE
MAIN -.-> OA
MAIN -.-> CX
MAIN -.-> GB
MAIN -.-> VIZ
MAIN -.-> IA
Technology Stack¶
| Layer | Technology | Purpose |
|---|---|---|
| Data Processing | Polars / Pandas | CSV/JSON loading, cleaning, transformation |
| NLP / Extraction | OpenAI API (GPT-4o-mini) | Technical concept extraction from abstracts |
| Data Enrichment | OpenAlex REST API | Paper concepts, topics, field classification |
| Content Enrichment | Semantic Scholar API, Unpaywall | Abstract and full-text retrieval |
| PDF Processing | PyMuPDF | PDF text extraction |
| Graph Engine | NetworkX | In-memory graph operations and algorithms |
| Visualization | Pyvis, Plotly | Interactive HTML visualizations |
| Serialization | JSON, GraphML, Parquet | Data persistence |
| Environment | uv | Python package and environment management |
| Configuration | YAML + dotenv | Settings and secrets management |