Getting Started¶

This guide walks you through setting up the Nobel Prize Knowledge Graph project and running the pipeline.

Prerequisites¶

Requirement	Version	Purpose
Python	>= 3.12	Runtime
uv	Latest	Package management
OpenAI API Key	—	LLM concept extraction (Phase 2b)

Installation¶

1. Clone the Repository¶

git clone https://github.com/example/nobel-kg.git
cd nobel-kg

2. Install Dependencies¶

uv add polars pandas pyarrow networkx pyvis plotly requests openai pyyaml python-dotenv tqdm scikit-learn pymupdf

3. Configure Environment¶

Create a .env file in the project root:

OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1  # Optional: custom endpoint

4. Verify Data¶

Ensure the data directory contains the required files:

ls data/26963326/db_data/
# Expected: author.csv, award_info.csv, institution.csv, laureate.csv, ...

ls data/26963326/json/
# Expected: award_details.json, laureate_info.json, publication_records.json

Running the Pipeline¶

Full Pipeline¶

uv run python main.py

This runs all 6 phases sequentially:

Phase 1 — Data Loading & Cleaning
Phase 1.5 — Content Enrichment (Semantic Scholar + OA PDFs)
Phase 2a — OpenAlex Concept Enrichment
Phase 2b — LLM Concept Extraction
Phase 3+4 — Knowledge Graph Construction
Phase 5 — Visualization
Phase 6 — Insight Analysis

Run Specific Phases¶

# Run only data loading
uv run python main.py --phase 1

# Run only graph construction
uv run python main.py --phase 3

# Run only visualization
uv run python main.py --phase 5

Skip Optional Steps¶

# Skip LLM extraction (saves API costs)
uv run python main.py --skip-llm

# Skip content enrichment
uv run python main.py --skip-enrich

Run Individual Modules¶

Each module can also be run standalone:

uv run python -m src.data_loader
uv run python -m src.openalex_enricher
uv run python -m src.concept_extractor
uv run python -m src.graph_builder
uv run python -m src.visualize
uv run python -m src.insight_analyzer

Viewing Outputs¶

Visualizations¶

Start a local HTTP server to view the interactive HTML visualizations:

python3 -m http.server 8765 --directory output/viz

Then open in your browser:

Network Graph: http://localhost:8765/network.html
Timeline: http://localhost:8765/timeline.html
Heatmap: http://localhost:8765/heatmap.html

Reports¶

The insight report is generated as Markdown:

cat output/reports/insight_report.md

Knowledge Graph¶

The graph data is stored in multiple formats:

# Full graph (JSON)
output/graph/knowledge_graph.json

# Separate node/edge files
output/graph/nodes.json
output/graph/edges.json

# GraphML format (for Gephi, Cytoscape, etc.)
output/graph/knowledge_graph.graphml

# Simplified concept-only graph
output/graph/concept_graph_simplified.json
output/graph/concept_graph_simplified_nodes.json
output/graph/concept_graph_simplified_edges.json
output/graph/concept_graph_simplified.graphml

Sample Mode¶

By default, the pipeline runs in sample mode using 5 representative laureates across 4 fields:

ID	Name	Field	Year
745	A. Michael Spence	Economics	2001
102	Aage N. Bohr	Physics	1975
779	Aaron Ciechanover	Chemistry	2004
114	Abdus Salam	Physics	1979
843	Ada E. Yonath	Chemistry	2009

This produces a manageable graph (~97 nodes, ~181 edges) suitable for development and testing.

Troubleshooting¶

Common Issues¶

Missing API Key

If you see OpenAI API key not found, ensure your .env file contains a valid OPENAI_API_KEY.

Large JSON File

The publication_records.json file is 2.3 GB. The pipeline uses streaming/sample loading to keep memory usage reasonable.

Rate Limits

OpenAlex API and Semantic Scholar API have rate limits. The pipeline includes automatic rate limiting and caching to handle this gracefully.

Next Steps¶

Read the Architecture guide to understand the system design
Explore the Pipeline documentation for each phase
Check the Configuration reference for customization options