Getting Started¶
This guide walks you through setting up the Nobel Prize Knowledge Graph project and running the pipeline.
Prerequisites¶
| Requirement | Version | Purpose |
|---|---|---|
| Python | >= 3.12 | Runtime |
| uv | Latest | Package management |
| OpenAI API Key | — | LLM concept extraction (Phase 2b) |
Installation¶
1. Clone the Repository¶
2. Install Dependencies¶
uv add polars pandas pyarrow networkx pyvis plotly requests openai pyyaml python-dotenv tqdm scikit-learn pymupdf
3. Configure Environment¶
Create a .env file in the project root:
OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1 # Optional: custom endpoint
4. Verify Data¶
Ensure the data directory contains the required files:
ls data/26963326/db_data/
# Expected: author.csv, award_info.csv, institution.csv, laureate.csv, ...
ls data/26963326/json/
# Expected: award_details.json, laureate_info.json, publication_records.json
Running the Pipeline¶
Full Pipeline¶
This runs all 6 phases sequentially:
- Phase 1 — Data Loading & Cleaning
- Phase 1.5 — Content Enrichment (Semantic Scholar + OA PDFs)
- Phase 2a — OpenAlex Concept Enrichment
- Phase 2b — LLM Concept Extraction
- Phase 3+4 — Knowledge Graph Construction
- Phase 5 — Visualization
- Phase 6 — Insight Analysis
Run Specific Phases¶
# Run only data loading
uv run python main.py --phase 1
# Run only graph construction
uv run python main.py --phase 3
# Run only visualization
uv run python main.py --phase 5
Skip Optional Steps¶
# Skip LLM extraction (saves API costs)
uv run python main.py --skip-llm
# Skip content enrichment
uv run python main.py --skip-enrich
Run Individual Modules¶
Each module can also be run standalone:
uv run python -m src.data_loader
uv run python -m src.openalex_enricher
uv run python -m src.concept_extractor
uv run python -m src.graph_builder
uv run python -m src.visualize
uv run python -m src.insight_analyzer
Viewing Outputs¶
Visualizations¶
Start a local HTTP server to view the interactive HTML visualizations:
Then open in your browser:
- Network Graph: http://localhost:8765/network.html
- Timeline: http://localhost:8765/timeline.html
- Heatmap: http://localhost:8765/heatmap.html
Reports¶
The insight report is generated as Markdown:
Knowledge Graph¶
The graph data is stored in multiple formats:
# Full graph (JSON)
output/graph/knowledge_graph.json
# Separate node/edge files
output/graph/nodes.json
output/graph/edges.json
# GraphML format (for Gephi, Cytoscape, etc.)
output/graph/knowledge_graph.graphml
# Simplified concept-only graph
output/graph/concept_graph_simplified.json
output/graph/concept_graph_simplified_nodes.json
output/graph/concept_graph_simplified_edges.json
output/graph/concept_graph_simplified.graphml
Sample Mode¶
By default, the pipeline runs in sample mode using 5 representative laureates across 4 fields:
| ID | Name | Field | Year |
|---|---|---|---|
| 745 | A. Michael Spence | Economics | 2001 |
| 102 | Aage N. Bohr | Physics | 1975 |
| 779 | Aaron Ciechanover | Chemistry | 2004 |
| 114 | Abdus Salam | Physics | 1979 |
| 843 | Ada E. Yonath | Chemistry | 2009 |
This produces a manageable graph (~97 nodes, ~181 edges) suitable for development and testing.
Troubleshooting¶
Common Issues¶
Missing API Key
If you see OpenAI API key not found, ensure your .env file contains a valid OPENAI_API_KEY.
Large JSON File
The publication_records.json file is 2.3 GB. The pipeline uses streaming/sample loading to keep memory usage reasonable.
Rate Limits
OpenAlex API and Semantic Scholar API have rate limits. The pipeline includes automatic rate limiting and caching to handle this gracefully.
Next Steps¶
- Read the Architecture guide to understand the system design
- Explore the Pipeline documentation for each phase
- Check the Configuration reference for customization options