Pipeline Overview¶
The Nobel Prize Knowledge Graph is built through a 6-phase pipeline, each handling a specific aspect of the data transformation process.
Pipeline Phases¶
flowchart LR
P1[Phase 1<br>Data Loading] --> P15[Phase 1.5<br>Content Enrichment]
P15 --> P2a[Phase 2a<br>OpenAlex Enrichment]
P2a --> P2b[Phase 2b<br>LLM Extraction]
P2b --> P34[Phase 3+4<br>Graph Construction]
P34 --> P5[Phase 5<br>Visualization]
P34 --> P6[Phase 6<br>Insight Analysis]
| Phase | Module | Time Est. | Description |
|---|---|---|---|
| 1 | data_loader.py |
~2 days | Load CSV/JSON, decode abstracts, clean and output Parquet |
| 1.5 | content_enricher.py |
~1 day | Enrich papers via Semantic Scholar, OA PDFs |
| 2a | openalex_enricher.py |
~1 day | Add concepts/topics/fields from OpenAlex API |
| 2b | concept_extractor.py |
~2 days | Extract structured concepts via GPT-4o-mini |
| 3+4 | graph_builder.py |
~2 days | Build NetworkX directed graph, export JSON/GraphML |
| 5 | visualize.py |
~1 day | Generate 3 interactive HTML visualizations |
| 6 | insight_analyzer.py |
~1 day | Analyze graph, generate Markdown insight report |
Running the Pipeline¶
Full Run¶
Phase Selection¶
uv run python main.py --phase 1 # Data loading only
uv run python main.py --phase 1.5 # Content enrichment only
uv run python main.py --phase 2 # Concept extraction (2a + 2b)
uv run python main.py --phase 3 # Graph construction
uv run python main.py --phase 5 # Visualization
uv run python main.py --phase 6 # Insight analysis
Skip Options¶
uv run python main.py --skip-llm # Skip LLM extraction (Phase 2b)
uv run python main.py --skip-enrich # Skip content enrichment (Phase 1.5)
Data Dependencies¶
Each phase depends on outputs from previous phases:
flowchart TD
CSV[CSV Files] --> P1
JSONL[publication_records.json] --> P1
P1[Phase 1] -->|laureates.parquet<br>awards.parquet<br>publications.parquet<br>institutions.parquet| P15[Phase 1.5]
P15 -->|publications.parquet<br>enriched| P2a[Phase 2a]
P2a -->|publications.parquet<br>with concepts| P2b[Phase 2b]
P2b -->|concepts.parquet<br>llm_extraction_raw.json| P34[Phase 3+4]
P1 -->|All Parquet files| P34
P34 -->|knowledge_graph.json| P5[Phase 5]
P34 -->|knowledge_graph.json| P6[Phase 6]
Output Summary¶
| Phase | Output Files | Location |
|---|---|---|
| 1 | laureates.parquet, awards.parquet, publications.parquet, institutions.parquet |
output/clean_data/ |
| 1.5 | Updated publications.parquet |
output/clean_data/ |
| 2a | Updated publications.parquet |
output/clean_data/ |
| 2b | llm_extraction_raw.json, concepts.parquet |
output/concepts/ |
| 3+4 | knowledge_graph.json, nodes.json, edges.json, knowledge_graph.graphml |
output/graph/ |
| 5 | network.html, timeline.html, heatmap.html |
output/viz/ |
| 6 | insight_report.md, insight_report.json |
output/reports/ |